Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776265
Adersh Miglani, Sumantra Dutta Roy, S. Chaudhury, J. B. Srivastava
We propose a framework for retrieving metric information for repeated objects from single perspective image. Relative affine structure, which is an invariant, is directly proportional to the Euclidean distance of a three dimensional point from a reference plane. The proposed method is based on this fundamental concept. The first object undergoes 4 × 4 transformation and results in a repeated object. We represent this transformation in terms of three relative affine structures along X, Y and Z axes. Additionally, we propose the possible extension of this framework for motion analysis - structure from motion and motion segmentation.
{"title":"Complete visual metrology using relative affine structure","authors":"Adersh Miglani, Sumantra Dutta Roy, S. Chaudhury, J. B. Srivastava","doi":"10.1109/NCVPRIPG.2013.6776265","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776265","url":null,"abstract":"We propose a framework for retrieving metric information for repeated objects from single perspective image. Relative affine structure, which is an invariant, is directly proportional to the Euclidean distance of a three dimensional point from a reference plane. The proposed method is based on this fundamental concept. The first object undergoes 4 × 4 transformation and results in a repeated object. We represent this transformation in terms of three relative affine structures along X, Y and Z axes. Additionally, we propose the possible extension of this framework for motion analysis - structure from motion and motion segmentation.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132087787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776226
Mayukh Sattiraju Student, Vikram Manikandan M Student, K. Manikantan, Associate Professor, S. Ramachandran
Face recognition under varying background and pose is challenging, and extracting background and pose invariant features is an effective approach to solve this problem. This paper proposes a skin detection-based approach for enhancing the performance of a Face Recognition (FR) system, employing a unique combination of Skin based background removal, Discrete Wavelet Transform (DWT), Adaptive Multi-Level Threshold Binary Particle Swarm Optimization (ABPSO) and an Error Control Feedback (ECF) loop. Skin based background removal is used for efficient background removal and ABPSO-based feature selection algorithm is used to search the feature space for the optimal feature subset. The ECF loop is used to neutralize pose variations. Experimental results, obtained by applying the proposed algorithm on Color FERET and CMUPIE face databases, show that the proposed system outperforms other FR systems. A significant increase in the recognition rate and substantial reduction in the number of features are observed.
{"title":"Adaptive BPSO based feature selection and skin detection based background removal for enhanced face recognition","authors":"Mayukh Sattiraju Student, Vikram Manikandan M Student, K. Manikantan, Associate Professor, S. Ramachandran","doi":"10.1109/NCVPRIPG.2013.6776226","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776226","url":null,"abstract":"Face recognition under varying background and pose is challenging, and extracting background and pose invariant features is an effective approach to solve this problem. This paper proposes a skin detection-based approach for enhancing the performance of a Face Recognition (FR) system, employing a unique combination of Skin based background removal, Discrete Wavelet Transform (DWT), Adaptive Multi-Level Threshold Binary Particle Swarm Optimization (ABPSO) and an Error Control Feedback (ECF) loop. Skin based background removal is used for efficient background removal and ABPSO-based feature selection algorithm is used to search the feature space for the optimal feature subset. The ECF loop is used to neutralize pose variations. Experimental results, obtained by applying the proposed algorithm on Color FERET and CMUPIE face databases, show that the proposed system outperforms other FR systems. A significant increase in the recognition rate and substantial reduction in the number of features are observed.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126617337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776267
C. Chattopadhyay, Sukhendu Das
This paper presents the design of STAR (Spatio-Temporal Analysis and Retrieval), an unsupervised Content Based Video Retrieval (CBVR) System. STAR's key insight and primary contribution is that it models video content using a joint spatio-temporal feature representation and retrieves videos from the database which have similar moving object and trajectories of motion. Foreground moving blobs from a moving camera video shot are extracted, along with a trajectory for camera motion compensation, to form the space-time volume (STV). The STV is processed to obtain the EMST-CSS representation, which can discriminate across different categories of videos. Performance of STAR has been evaluated qualitatively and quantitatively using precision-recall metric on benchmark video datasets having unconstrained video shots, to exhibit efficiency of STAR.
提出了一种基于无监督内容的视频检索(CBVR)系统STAR (spatial - temporal Analysis and Retrieval)的设计。STAR的关键洞察力和主要贡献在于,它使用联合时空特征表示对视频内容进行建模,并从数据库中检索具有相似运动对象和运动轨迹的视频。从移动摄像机的视频镜头中提取前景移动斑点,并结合轨迹进行摄像机运动补偿,形成时空体(STV)。对STV进行处理得到EMST-CSS表示,该表示可以区分不同类别的视频。在具有无约束视频镜头的基准视频数据集上,使用精确召回度量对STAR的性能进行了定性和定量评估,以展示STAR的效率。
{"title":"STAR: A Content Based Video Retrieval system for moving camera video shots","authors":"C. Chattopadhyay, Sukhendu Das","doi":"10.1109/NCVPRIPG.2013.6776267","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776267","url":null,"abstract":"This paper presents the design of STAR (Spatio-Temporal Analysis and Retrieval), an unsupervised Content Based Video Retrieval (CBVR) System. STAR's key insight and primary contribution is that it models video content using a joint spatio-temporal feature representation and retrieves videos from the database which have similar moving object and trajectories of motion. Foreground moving blobs from a moving camera video shot are extracted, along with a trajectory for camera motion compensation, to form the space-time volume (STV). The STV is processed to obtain the EMST-CSS representation, which can discriminate across different categories of videos. Performance of STAR has been evaluated qualitatively and quantitatively using precision-recall metric on benchmark video datasets having unconstrained video shots, to exhibit efficiency of STAR.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125124271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776157
Sangheeta Roy, P. Roy, P. Shivakumara, U. Pal
Text recognition from a natural scene and video is challenging compared to that in scanned document images. This is due to the problems of text on different sources of various styles, font variation, font size variations, background variations, etc. There are approaches for word segmentation from video and scene images to feed the word image into OCRs. Nevertheless, such methods often fail to yield satisfactory results in recognition. Therefore, in this paper, we propose to combine Hidden Markov Model (HMM) and Convolutional Neural Network (CNN) to achieve good recognition rate. Sequential gradient features with HMM help to find character alignment of a word. Later the character alignments are verified by Convolutional Neural network (CNN). The approach is tested on both video and scene data to show the effectiveness of the proposed approach. The results are found encouraging.
{"title":"Word recognition in natural scene and video images using Hidden Markov Model","authors":"Sangheeta Roy, P. Roy, P. Shivakumara, U. Pal","doi":"10.1109/NCVPRIPG.2013.6776157","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776157","url":null,"abstract":"Text recognition from a natural scene and video is challenging compared to that in scanned document images. This is due to the problems of text on different sources of various styles, font variation, font size variations, background variations, etc. There are approaches for word segmentation from video and scene images to feed the word image into OCRs. Nevertheless, such methods often fail to yield satisfactory results in recognition. Therefore, in this paper, we propose to combine Hidden Markov Model (HMM) and Convolutional Neural Network (CNN) to achieve good recognition rate. Sequential gradient features with HMM help to find character alignment of a word. Later the character alignments are verified by Convolutional Neural network (CNN). The approach is tested on both video and scene data to show the effectiveness of the proposed approach. The results are found encouraging.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115368154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776185
Richa Mishra, Prasanna Kumar, S. Chaudhury, I. Sreedevi
Large space with many cameras require huge storage and computational power to process these data for surveillance applications. In this paper we propose a distributed camera and processing based face detection and recognition system which can generate information for finding spatiotemporal movement pattern of individuals over a large monitored space. The system is built upon Hadoop Distributed File System using map reduce programming model. A novel key generation scheme using distance based hashing technique has been used for distribution of the face matching task. Experimental results have established effectiveness of the technique.
{"title":"Monitoring a large surveillance space through distributed face matching","authors":"Richa Mishra, Prasanna Kumar, S. Chaudhury, I. Sreedevi","doi":"10.1109/NCVPRIPG.2013.6776185","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776185","url":null,"abstract":"Large space with many cameras require huge storage and computational power to process these data for surveillance applications. In this paper we propose a distributed camera and processing based face detection and recognition system which can generate information for finding spatiotemporal movement pattern of individuals over a large monitored space. The system is built upon Hadoop Distributed File System using map reduce programming model. A novel key generation scheme using distance based hashing technique has been used for distribution of the face matching task. Experimental results have established effectiveness of the technique.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114530746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776182
Kavita Bhardwaj, S. Chaudhury, Sumantra Dutta Roy
In this paper, we are presenting a framework for “User's Personalized Workspace” by augmenting the physical paper and digital document. The paper based interactions are seamlessly integrated with digital document based interactions for reading as a activity. For instance when user is involved in reading activity, writing becomes complimentary. In a academic system, paper based presentation mode has facilitated such exercises. Despite rendering the annotation on digital document and store it onto the database, the content of the paper encircled or underlined is used to hyperlink the document. Synchronizing a physical paper and those of digital version in seamless fashion from a user's perspective is the main objective of this work. We have also compared the existing systems which focus on one activity or the other in our proposed system.
{"title":"Augmented paper system: A framework for User's Personalized Workspace","authors":"Kavita Bhardwaj, S. Chaudhury, Sumantra Dutta Roy","doi":"10.1109/NCVPRIPG.2013.6776182","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776182","url":null,"abstract":"In this paper, we are presenting a framework for “User's Personalized Workspace” by augmenting the physical paper and digital document. The paper based interactions are seamlessly integrated with digital document based interactions for reading as a activity. For instance when user is involved in reading activity, writing becomes complimentary. In a academic system, paper based presentation mode has facilitated such exercises. Despite rendering the annotation on digital document and store it onto the database, the content of the paper encircled or underlined is used to hyperlink the document. Synchronizing a physical paper and those of digital version in seamless fashion from a user's perspective is the main objective of this work. We have also compared the existing systems which focus on one activity or the other in our proposed system.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128852162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776232
S. Panwar, N. Nain
Text segmentation can be defined as the process of splitting the images of handwritten text document into pieces corresponding to single lines, words and character. This is a very challenging task because in handwritten documents curved text lines appear frequently with different skew and slant angles. After segmentation of word or stroke, also defined as finding the connected components in handwritten text document, we have to sequence the strokes according to the document so that the meaning of the document is preserved. In this paper, We use bottom up grouping approach for segmentation. We have used a novel connectivity strength parameter with depth first search approach for extraction of connected components of the same line from complete connected components of the given document. The exact sequence of connected components is stored in the sequential vector which contains the label of the components. The proposed cursive stroke sequencing technique is implemented and tested on a benchmark IAM database providing encouraging results. Quantitative analysis also shows that this approach gives better results compared to existing segmentation techniques and overcomes the problems encountered in Hill-and-dale writing styles and overlapped and touched lines. The accuracy of the proposed sequencing technique is 98%.
{"title":"Cursive stroke sequencing for handwritten text documents recognition","authors":"S. Panwar, N. Nain","doi":"10.1109/NCVPRIPG.2013.6776232","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776232","url":null,"abstract":"Text segmentation can be defined as the process of splitting the images of handwritten text document into pieces corresponding to single lines, words and character. This is a very challenging task because in handwritten documents curved text lines appear frequently with different skew and slant angles. After segmentation of word or stroke, also defined as finding the connected components in handwritten text document, we have to sequence the strokes according to the document so that the meaning of the document is preserved. In this paper, We use bottom up grouping approach for segmentation. We have used a novel connectivity strength parameter with depth first search approach for extraction of connected components of the same line from complete connected components of the given document. The exact sequence of connected components is stored in the sequential vector which contains the label of the components. The proposed cursive stroke sequencing technique is implemented and tested on a benchmark IAM database providing encouraging results. Quantitative analysis also shows that this approach gives better results compared to existing segmentation techniques and overcomes the problems encountered in Hill-and-dale writing styles and overlapped and touched lines. The accuracy of the proposed sequencing technique is 98%.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"700 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122986296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776160
V. Mall, A. Roy, S. Mitra
Recent years have witnessed an exponential growth in the use of digital images due to development of high quality digital cameras and multimedia technology. Easy availability of image editing software has made digital image processing very popular. Ready to use software are available on internet which can be easily used to manipulate the images. In such an environment, the integrity of the image can not be taken for granted. Malicious tampering has serious implication for legal documents, copyright issues and forensic cases. Researchers have come forward with large number of methods to detect image tampering. The proposed method is based on hash generation technique using singular value decomposition. Design of an efficient hash vector as proposed will help in detection and localization of image tampering. The proposed method shows that it is robust against content preserving manipulation but extremely sensitive to even very minute structural tampering.
{"title":"Digital image tampering detection and localization using singular value decomposition technique","authors":"V. Mall, A. Roy, S. Mitra","doi":"10.1109/NCVPRIPG.2013.6776160","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776160","url":null,"abstract":"Recent years have witnessed an exponential growth in the use of digital images due to development of high quality digital cameras and multimedia technology. Easy availability of image editing software has made digital image processing very popular. Ready to use software are available on internet which can be easily used to manipulate the images. In such an environment, the integrity of the image can not be taken for granted. Malicious tampering has serious implication for legal documents, copyright issues and forensic cases. Researchers have come forward with large number of methods to detect image tampering. The proposed method is based on hash generation technique using singular value decomposition. Design of an efficient hash vector as proposed will help in detection and localization of image tampering. The proposed method shows that it is robust against content preserving manipulation but extremely sensitive to even very minute structural tampering.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"224 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124468915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776263
Sanjib Das, H. ShahJaimeen, P. Bora
Animation geometry compression involves compressing the geometry data of dynamic three-dimensional (3D) triangular meshes representing the animation frames. The scalability issue of geometry compression addresses compressing the geometry in a single scale and decompressing it in multiple scales. One of the algorithms for animation geometry compression employs the skinning based motion prediction of vertices and the temporal wavelet transform (TWT) on the prediction errors. This paper presents an encoder and a decoder structure for achieving temporally scalable implementation of the algorithm. The frame-wise prediction errors due to motion based clustering of a group of affine transformed vertices are converted into a layered structure of the frames using the TWT. The affine transformation data of vertices, weights corresponding to each cluster of vertices and the wavelet coefficients of the prediction errors are quantized and encoded using the entropy coding. The resulting bit-stream is arranged in a layered structure to achieve temporal scalability. The base layer consists of the connectivity coded first frame, indices of the clusters of vertices, weights corresponding to each cluster of a vertex, the approximation sub-band of prediction error and the affine transformations corresponding to the approximation frames. The enhancement layers consist of the detailed sub-bands of prediction error and the affine transformations corresponding to the detailed frames. The scalable encoder and decoder are tested on some standard animation sequences and the experimental results show good performance in terms of scalable rates and distortions.
{"title":"Temporally scalable compression of animation geometry","authors":"Sanjib Das, H. ShahJaimeen, P. Bora","doi":"10.1109/NCVPRIPG.2013.6776263","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776263","url":null,"abstract":"Animation geometry compression involves compressing the geometry data of dynamic three-dimensional (3D) triangular meshes representing the animation frames. The scalability issue of geometry compression addresses compressing the geometry in a single scale and decompressing it in multiple scales. One of the algorithms for animation geometry compression employs the skinning based motion prediction of vertices and the temporal wavelet transform (TWT) on the prediction errors. This paper presents an encoder and a decoder structure for achieving temporally scalable implementation of the algorithm. The frame-wise prediction errors due to motion based clustering of a group of affine transformed vertices are converted into a layered structure of the frames using the TWT. The affine transformation data of vertices, weights corresponding to each cluster of vertices and the wavelet coefficients of the prediction errors are quantized and encoded using the entropy coding. The resulting bit-stream is arranged in a layered structure to achieve temporal scalability. The base layer consists of the connectivity coded first frame, indices of the clusters of vertices, weights corresponding to each cluster of a vertex, the approximation sub-band of prediction error and the affine transformations corresponding to the approximation frames. The enhancement layers consist of the detailed sub-bands of prediction error and the affine transformations corresponding to the detailed frames. The scalable encoder and decoder are tested on some standard animation sequences and the experimental results show good performance in terms of scalable rates and distortions.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"321 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121681621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/NCVPRIPG.2013.6776268
M. K. Reddy, Sahil Arora, R. Venkatesh Babu
Compact representation of visual content has emerged as an important topic in the context of large scale image/video retrieval. The recently proposed Vector of Locally Aggregated Descriptors (VLAD) has shown to outperform other existing techniques for retrieval. In this paper, we propose two spatio-temporal features for constructing VLAD vectors for videos in the context of large scale video retrieval. Given a particular query video, our aim is to retrieve similar videos from the database. Experiments are conducted on UCF50 and HMDB51 datasets, which pose challenges in the form of camera motion, view-point variation, large intra-class variation, etc. The paper proposes the following two spatio-temporal features for constructing VLADs i) Local Histogram of Oriented Optical Flow (LHOOF), and ii) Space-Time Invariant Points (STIP). The performance of these proposed features are compared with SIFT based spatial feature. The mean average precision (MAP) indicates the better retrieval performance of the proposed spatio-temporal feature over spatial feature.
{"title":"Spatio-temporal feature based VLAD for efficient video retrieval","authors":"M. K. Reddy, Sahil Arora, R. Venkatesh Babu","doi":"10.1109/NCVPRIPG.2013.6776268","DOIUrl":"https://doi.org/10.1109/NCVPRIPG.2013.6776268","url":null,"abstract":"Compact representation of visual content has emerged as an important topic in the context of large scale image/video retrieval. The recently proposed Vector of Locally Aggregated Descriptors (VLAD) has shown to outperform other existing techniques for retrieval. In this paper, we propose two spatio-temporal features for constructing VLAD vectors for videos in the context of large scale video retrieval. Given a particular query video, our aim is to retrieve similar videos from the database. Experiments are conducted on UCF50 and HMDB51 datasets, which pose challenges in the form of camera motion, view-point variation, large intra-class variation, etc. The paper proposes the following two spatio-temporal features for constructing VLADs i) Local Histogram of Oriented Optical Flow (LHOOF), and ii) Space-Time Invariant Points (STIP). The performance of these proposed features are compared with SIFT based spatial feature. The mean average precision (MAP) indicates the better retrieval performance of the proposed spatio-temporal feature over spatial feature.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134037093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}