Adrian Daniel Popescu, Alexandra-Lucian Gînsca, H. Borgne
When textual annotations of Web and social media images are poor or missing, content-based image retrieval is an interesting way to access them. Finding an optimal trade-off between accuracy and scalability for CBIR is challenging in practice. We propose a retrieval method whose complexity is nearly independent of the collection scale and does not degrade results quality. Images are represented with sparse semantic features that can be stored as an inverted index. Search complexity is drastically reduced by (1) considering the query feature dimensions independently and thus turning search into a concatenation operation and (2) pruning the index in function of a retrieval objective. To improve precision, the inverted index look-up is complemented with an exhaustive search over a fixed size list of intermediary results. We run experiments with three public collections and results show that our much faster method slightly outperforms an exhaustive search done with two competitive baselines.
{"title":"Scale-Free Content Based Image Retrieval (or Nearly so)","authors":"Adrian Daniel Popescu, Alexandra-Lucian Gînsca, H. Borgne","doi":"10.1109/ICCVW.2017.42","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.42","url":null,"abstract":"When textual annotations of Web and social media images are poor or missing, content-based image retrieval is an interesting way to access them. Finding an optimal trade-off between accuracy and scalability for CBIR is challenging in practice. We propose a retrieval method whose complexity is nearly independent of the collection scale and does not degrade results quality. Images are represented with sparse semantic features that can be stored as an inverted index. Search complexity is drastically reduced by (1) considering the query feature dimensions independently and thus turning search into a concatenation operation and (2) pruning the index in function of a retrieval objective. To improve precision, the inverted index look-up is complemented with an exhaustive search over a fixed size list of intermediary results. We run experiments with three public collections and results show that our much faster method slightly outperforms an exhaustive search done with two competitive baselines.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114843465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Temporal information is the main source of discriminating characteristics for the recognition of proprioceptive activities in first-person vision (FPV). In this paper, we propose a motion representation that uses stacked spectrograms. These spectrograms are generated over temporal windows from mean grid-optical-flow vectors and the displacement vectors of the intensity centroid. The stacked representation enables us to use 2D convolutions to learn and extract global motion features. Moreover, we employ a long short-term memory (LSTM) network to encode the temporal dependency among consecutive samples recursively. Experimental results show that the proposed approach achieves state-of-the-art performance in the largest public dataset for FPV activity recognition.
{"title":"A Long Short-Term Memory Convolutional Neural Network for First-Person Vision Activity Recognition","authors":"Girmaw Abebe, A. Cavallaro","doi":"10.1109/ICCVW.2017.159","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.159","url":null,"abstract":"Temporal information is the main source of discriminating characteristics for the recognition of proprioceptive activities in first-person vision (FPV). In this paper, we propose a motion representation that uses stacked spectrograms. These spectrograms are generated over temporal windows from mean grid-optical-flow vectors and the displacement vectors of the intensity centroid. The stacked representation enables us to use 2D convolutions to learn and extract global motion features. Moreover, we employ a long short-term memory (LSTM) network to encode the temporal dependency among consecutive samples recursively. Experimental results show that the proposed approach achieves state-of-the-art performance in the largest public dataset for FPV activity recognition.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125596090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feature pooling is a method that summarizes local descriptors in an image using spatial information. Spatial pyramid matching uses the statistics of local features in an image subregion as a global feature. However, the disadvantages of this method are that there is no theoretical guideline for selecting the pooling region, robustness to small image translation is lost around the edges of the pooling region, the information encoded in the different feature pyramids overlaps, and thus recognition performance stagnates as a greater pyramid size is selected. In this research, we propose a novel interpretation that regards feature pooling as an orthogonal projection in the space of functions that maps the image space to the local feature space. Moreover, we propose a novel feature-pooling method that orthogonally projects the function form of local descriptors into the space of low-degree polynomials. We also evaluate the robustness of the proposed method. Experimental results demonstrate the effectiveness of the proposed methods.
{"title":"Spatial-Temporal Weighted Pyramid Using Spatial Orthogonal Pooling","authors":"Yusuke Mukuta, Y. Ushiku, T. Harada","doi":"10.1109/ICCVW.2017.127","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.127","url":null,"abstract":"Feature pooling is a method that summarizes local descriptors in an image using spatial information. Spatial pyramid matching uses the statistics of local features in an image subregion as a global feature. However, the disadvantages of this method are that there is no theoretical guideline for selecting the pooling region, robustness to small image translation is lost around the edges of the pooling region, the information encoded in the different feature pyramids overlaps, and thus recognition performance stagnates as a greater pyramid size is selected. In this research, we propose a novel interpretation that regards feature pooling as an orthogonal projection in the space of functions that maps the image space to the local feature space. Moreover, we propose a novel feature-pooling method that orthogonally projects the function form of local descriptors into the space of low-degree polynomials. We also evaluate the robustness of the proposed method. Experimental results demonstrate the effectiveness of the proposed methods.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127975705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. B. Logoglu, Hazal Lezki, M. K. Yucel, A. Ozturk, Alper Kucukkomurler, Batuhan Karagöz, Aykut Erdem, Erkut Erdem
Moving Object Detection is one of the integral tasks for aerial reconnaissance and surveillance applications. Despite the problem's rising potential due to increasing availability of unmanned aerial vehicles, moving object detection suffers from a lack of widely-accepted, correctly labelled dataset that would facilitate a robust evaluation of the techniques published by the community. Towards this end, we compile a new dataset by manually annotating several sequences from VIVID and UAV123 datasets for moving object detection. We also propose a feature-based, efficient pipeline that is optimized for near real-time performance on GPU-based embedded SoMs (system on module). We evaluate our pipeline on this extended dataset for low altitude moving object detection. Ground-truth annotations are made publicly available to the community to foster further research in moving object detection field.
{"title":"Feature-Based Efficient Moving Object Detection for Low-Altitude Aerial Platforms","authors":"K. B. Logoglu, Hazal Lezki, M. K. Yucel, A. Ozturk, Alper Kucukkomurler, Batuhan Karagöz, Aykut Erdem, Erkut Erdem","doi":"10.1109/ICCVW.2017.248","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.248","url":null,"abstract":"Moving Object Detection is one of the integral tasks for aerial reconnaissance and surveillance applications. Despite the problem's rising potential due to increasing availability of unmanned aerial vehicles, moving object detection suffers from a lack of widely-accepted, correctly labelled dataset that would facilitate a robust evaluation of the techniques published by the community. Towards this end, we compile a new dataset by manually annotating several sequences from VIVID and UAV123 datasets for moving object detection. We also propose a feature-based, efficient pipeline that is optimized for near real-time performance on GPU-based embedded SoMs (system on module). We evaluate our pipeline on this extended dataset for low altitude moving object detection. Ground-truth annotations are made publicly available to the community to foster further research in moving object detection field.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134158467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an effective method for continuous gesture recognition. The method consists of two modules: segmentation and recognition. In the segmentation module, a continuous gesture sequence is segmented into isolated gesture sequences by classifying the frames into gesture frames and transitional frames using two stream convolutional neural networks. In the recognition module, our method exploits the spatiotemporal information embedded in RGB and depth sequences. For the depth modality, our method converts a sequence into Dynamic Images and Motion Dynamic Images through rank pooling and input them to Convolutional Neural Networks respectively. For the RGB modality, our method adopts Convolutional LSTM Networks to learn long-term spatiotemporal features from short-term spatiotemporal features obtained by a 3D convolutional neural network. Our method has been evaluated on ChaLearn LAP Large-scale Continuous Gesture Dataset and achieved the state-of-the-art performance.
{"title":"Large-Scale Multimodal Gesture Segmentation and Recognition Based on Convolutional Neural Networks","authors":"Huogen Wang, Pichao Wang, Zhanjie Song, W. Li","doi":"10.1109/ICCVW.2017.371","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.371","url":null,"abstract":"This paper presents an effective method for continuous gesture recognition. The method consists of two modules: segmentation and recognition. In the segmentation module, a continuous gesture sequence is segmented into isolated gesture sequences by classifying the frames into gesture frames and transitional frames using two stream convolutional neural networks. In the recognition module, our method exploits the spatiotemporal information embedded in RGB and depth sequences. For the depth modality, our method converts a sequence into Dynamic Images and Motion Dynamic Images through rank pooling and input them to Convolutional Neural Networks respectively. For the RGB modality, our method adopts Convolutional LSTM Networks to learn long-term spatiotemporal features from short-term spatiotemporal features obtained by a 3D convolutional neural network. Our method has been evaluated on ChaLearn LAP Large-scale Continuous Gesture Dataset and achieved the state-of-the-art performance.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134387283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feng Li, Yingjie Yao, P. Li, D. Zhang, W. Zuo, Ming-Hsuan Yang
The aspect ratio variation frequently appears in visual tracking and has a severe influence on performance. Although many correlation filter (CF)-based trackers have also been suggested for scale adaptive tracking, few studies have been given to handle the aspect ratio variation for CF trackers. In this paper, we make the first attempt to address this issue by introducing a family of 1D boundary CFs to localize the left, right, top, and bottom boundaries in videos. This allows us cope with the aspect ratio variation flexibly during tracking. Specifically, we present a novel tracking model to integrate 1D Boundary and 2D Center CFs (IBCCF) where boundary and center filters are enforced by a near-orthogonality regularization term. To optimize our IBCCF model, we develop an alternating direction method of multipliers. Experiments on several datasets show that IBCCF can effectively handle aspect ratio variation, and achieves state-of-the-art performance in terms of accuracy and robustness.
{"title":"Integrating Boundary and Center Correlation Filters for Visual Tracking with Aspect Ratio Variation","authors":"Feng Li, Yingjie Yao, P. Li, D. Zhang, W. Zuo, Ming-Hsuan Yang","doi":"10.1109/ICCVW.2017.234","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.234","url":null,"abstract":"The aspect ratio variation frequently appears in visual tracking and has a severe influence on performance. Although many correlation filter (CF)-based trackers have also been suggested for scale adaptive tracking, few studies have been given to handle the aspect ratio variation for CF trackers. In this paper, we make the first attempt to address this issue by introducing a family of 1D boundary CFs to localize the left, right, top, and bottom boundaries in videos. This allows us cope with the aspect ratio variation flexibly during tracking. Specifically, we present a novel tracking model to integrate 1D Boundary and 2D Center CFs (IBCCF) where boundary and center filters are enforced by a near-orthogonality regularization term. To optimize our IBCCF model, we develop an alternating direction method of multipliers. Experiments on several datasets show that IBCCF can effectively handle aspect ratio variation, and achieves state-of-the-art performance in terms of accuracy and robustness.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131758879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we address the issue of efficient computation in deep kernel networks. We propose a novel framework that reduces dramatically the complexity of evaluating these deep kernels. Our method is based on a coarse-to-fine cascade of networks designed for efficient computation; early stages of the cascade are cheap and reject many patterns efficiently while deep stages are more expensive and accurate. The design principle of these reduced complexity networks is based on a variant of the cross-entropy criterion that reduces the complexity of the networks in the cascade while preserving all the positive responses of the original kernel network. Experiments conducted - on the challenging and time demanding change detection task, on very large satellite images - show that our proposed coarse-to-fine approach is effective and highly efficient.
{"title":"Coarse-to-Fine Deep Kernel Networks","authors":"H. Sahbi","doi":"10.1109/ICCVW.2017.137","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.137","url":null,"abstract":"In this paper, we address the issue of efficient computation in deep kernel networks. We propose a novel framework that reduces dramatically the complexity of evaluating these deep kernels. Our method is based on a coarse-to-fine cascade of networks designed for efficient computation; early stages of the cascade are cheap and reject many patterns efficiently while deep stages are more expensive and accurate. The design principle of these reduced complexity networks is based on a variant of the cross-entropy criterion that reduces the complexity of the networks in the cascade while preserving all the positive responses of the original kernel network. Experiments conducted - on the challenging and time demanding change detection task, on very large satellite images - show that our proposed coarse-to-fine approach is effective and highly efficient.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127568402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extracting skeletons from natural images is a challenging problem, due to complex backgrounds in the scene and various scales of objects. To address this problem, we propose a two-stream fully convolutional neural network which uses the original image and its corresponding semantic segmentation probability map as inputs and predicts the skeleton map using merged multi-scale features. We find that the semantic segmentation probability map is complementary to the corresponding color image and can boost the performance of our baseline model which trained only on color images. We conduct experiments on SK-LARGE dataset and the F-measure of our method on validation set is 0.738 which outperforms current state-of-the-art significantly and demonstrates the effectiveness of our proposed approach.
{"title":"Fusing Image and Segmentation Cues for Skeleton Extraction in the Wild","authors":"Xiaolong Liu, Pengyuan Lyu, X. Bai, Ming-Ming Cheng","doi":"10.1109/ICCVW.2017.205","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.205","url":null,"abstract":"Extracting skeletons from natural images is a challenging problem, due to complex backgrounds in the scene and various scales of objects. To address this problem, we propose a two-stream fully convolutional neural network which uses the original image and its corresponding semantic segmentation probability map as inputs and predicts the skeleton map using merged multi-scale features. We find that the semantic segmentation probability map is complementary to the corresponding color image and can boost the performance of our baseline model which trained only on color images. We conduct experiments on SK-LARGE dataset and the F-measure of our method on validation set is 0.738 which outperforms current state-of-the-art significantly and demonstrates the effectiveness of our proposed approach.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131051019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work proposes a system for retrieving clothing and fashion products from video content. Although films and television are the perfect showcase for fashion brands to promote their products, spectators are not always aware of where to buy the latest trends they see on screen. Here, a framework for breaking the gap between fashion products shown on videos and users is presented. By relating clothing items and video frames in an indexed database and performing frame retrieval with temporal aggregation and fast indexing techniques, we can find fashion products from videos in a simple and non-intrusive way. Experiments in a large-scale dataset conducted here show that, by using the proposed framework, memory requirements can be reduced by 42.5X with respect to linear search, whereas accuracy is maintained at around 90%.
{"title":"Dress Like a Star: Retrieving Fashion Products from Videos","authors":"Noa García, George Vogiatzis","doi":"10.1109/ICCVW.2017.270","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.270","url":null,"abstract":"This work proposes a system for retrieving clothing and fashion products from video content. Although films and television are the perfect showcase for fashion brands to promote their products, spectators are not always aware of where to buy the latest trends they see on screen. Here, a framework for breaking the gap between fashion products shown on videos and users is presented. By relating clothing items and video frames in an indexed database and performing frame retrieval with temporal aggregation and fast indexing techniques, we can find fashion products from videos in a simple and non-intrusive way. Experiments in a large-scale dataset conducted here show that, by using the proposed framework, memory requirements can be reduced by 42.5X with respect to linear search, whereas accuracy is maintained at around 90%.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130727682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Fernández, David Lima, F. Bianconi, F. Smeraldi
Color information is generally considered useful for texture analysis. However, an important category of highly effective texture descriptors - namely rank features - has no obvious extension to color spaces, on which no canonical order is defined. In this work, we explore the use of partial orders in conjunction with rank features. We introduce the rank transform based on product ordering, that generalizes the classic rank transform to RGB space by a combined tally of dominated and non-comparable pixels. Experimental results on nine heterogeneous standard databases confirm that our approach outperforms the standard rank transform and its extension to lexicographic and bit mixing total orders, as well as to the preorders based on the Euclidean distance to a reference color. The low computational complexity and compact codebook size of the transform make it suitable for multi-scale approaches.
{"title":"Compact Color Texture Descriptor Based on Rank Transform and Product Ordering in the RGB Color Space","authors":"Antonio Fernández, David Lima, F. Bianconi, F. Smeraldi","doi":"10.1109/ICCVW.2017.126","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.126","url":null,"abstract":"Color information is generally considered useful for texture analysis. However, an important category of highly effective texture descriptors - namely rank features - has no obvious extension to color spaces, on which no canonical order is defined. In this work, we explore the use of partial orders in conjunction with rank features. We introduce the rank transform based on product ordering, that generalizes the classic rank transform to RGB space by a combined tally of dominated and non-comparable pixels. Experimental results on nine heterogeneous standard databases confirm that our approach outperforms the standard rank transform and its extension to lexicographic and bit mixing total orders, as well as to the preorders based on the Euclidean distance to a reference color. The low computational complexity and compact codebook size of the transform make it suitable for multi-scale approaches.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133070882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}