Martin Ahrnbom, M. B. Jensen, Kalle Åström, M. Nilsson, H. Ardö, T. Moeslund
Neural networks designed for real-time object detection have recently improved significantly, but in practice, looking at only a single RGB image at the time may not be ideal. For example, when detecting objects in videos, a foreground detection algorithm can be used to obtain compact temporal data, which can be fed into a neural network alongside RGB images. We propose an approach for doing this, based on an existing object detector, that re-uses pretrained weights for the processing of RGB images. The neural network was tested on the VIRAT dataset with annotations for object detection, a problem this approach is well suited for. The accuracy was found to improve significantly (up to 66%), with a roughly 40% increase in computational time.
{"title":"Improving a Real-Time Object Detector with Compact Temporal Information","authors":"Martin Ahrnbom, M. B. Jensen, Kalle Åström, M. Nilsson, H. Ardö, T. Moeslund","doi":"10.1109/ICCVW.2017.31","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.31","url":null,"abstract":"Neural networks designed for real-time object detection have recently improved significantly, but in practice, looking at only a single RGB image at the time may not be ideal. For example, when detecting objects in videos, a foreground detection algorithm can be used to obtain compact temporal data, which can be fed into a neural network alongside RGB images. We propose an approach for doing this, based on an existing object detector, that re-uses pretrained weights for the processing of RGB images. The neural network was tested on the VIRAT dataset with annotations for object detection, a problem this approach is well suited for. The accuracy was found to improve significantly (up to 66%), with a roughly 40% increase in computational time.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130626561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, more and more people buy products via e-commerce websites. We can not only compare prices from different online retailers but also obtain useful review comments from other customers. Especially, people tend to search for visually similar products when they are looking for possible candidates. The need for product search is emerging. To tackle the problem, recent works integrate different additional information (e.g., attributes, image pairs, category) with deep convolutional neural networks (CNNs) for solving cross-domain image retrieval and product search. Based on the state-of-the-art approaches, we propose a rank-based candidate selection for feature learning. Given a query image, we attempt to push hard negative (irrelevant) images away from queries and make ambiguous positive (relevant) images close to queries. We investigate the effects of global and attention-based local features on the proposed method, and achieve 15.8% relative gain for product search.
{"title":"Feature Learning with Rank-Based Candidate Selection for Product Search","authors":"Y. Kuo, Winston H. Hsu","doi":"10.1109/ICCVW.2017.44","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.44","url":null,"abstract":"Nowadays, more and more people buy products via e-commerce websites. We can not only compare prices from different online retailers but also obtain useful review comments from other customers. Especially, people tend to search for visually similar products when they are looking for possible candidates. The need for product search is emerging. To tackle the problem, recent works integrate different additional information (e.g., attributes, image pairs, category) with deep convolutional neural networks (CNNs) for solving cross-domain image retrieval and product search. Based on the state-of-the-art approaches, we propose a rank-based candidate selection for feature learning. Given a query image, we attempt to push hard negative (irrelevant) images away from queries and make ambiguous positive (relevant) images close to queries. We investigate the effects of global and attention-based local features on the proposed method, and achieve 15.8% relative gain for product search.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125551193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Person re-identification (re-id) has drawn significant attention in the recent decade. The design of view-invariant feature descriptors is one of the most crucial problems for this task. Covariance descriptors have often been used in person re-id because of their invariance properties. More recently, a new state-of-the-art performance was achieved by also including first-order moment and two-level Gaussian descriptors. However, using second-order or lower moments information might not be enough when the feature distribution is not Gaussian. In this paper, we address this limitation, by using the empirical (symmetric positive definite) moment matrix to incorporate higher order moments and by applying the on-manifold mean to pool the features along horizontal strips. The new descriptor, based on the on-manifold mean of a moment matrix (moM), can be used to approximate more complex, non-Gaussian, distributions of the pixel features within a mid-sized local patch. We have evaluated the proposed feature on five widely used re-id datasets. The experiments show that the moM and hierarchical Gaussian descriptor (GOG) [30] features complement each other and that using a combination of both features achieves a comparable performance with the state-of-the-art methods.
{"title":"moM: Mean of Moments Feature for Person Re-identification","authors":"Mengran Gou, O. Camps, M. Sznaier","doi":"10.1109/ICCVW.2017.154","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.154","url":null,"abstract":"Person re-identification (re-id) has drawn significant attention in the recent decade. The design of view-invariant feature descriptors is one of the most crucial problems for this task. Covariance descriptors have often been used in person re-id because of their invariance properties. More recently, a new state-of-the-art performance was achieved by also including first-order moment and two-level Gaussian descriptors. However, using second-order or lower moments information might not be enough when the feature distribution is not Gaussian. In this paper, we address this limitation, by using the empirical (symmetric positive definite) moment matrix to incorporate higher order moments and by applying the on-manifold mean to pool the features along horizontal strips. The new descriptor, based on the on-manifold mean of a moment matrix (moM), can be used to approximate more complex, non-Gaussian, distributions of the pixel features within a mid-sized local patch. We have evaluated the proposed feature on five widely used re-id datasets. The experiments show that the moM and hierarchical Gaussian descriptor (GOG) [30] features complement each other and that using a combination of both features achieves a comparable performance with the state-of-the-art methods.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"423 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126714336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, an efficient spotting-recognition framework is proposed to tackle the large scale continuous gesture recognition problem with the RGB-D data input. Concretely, continuous gestures are firstly segmented into isolated gestures based on the accurate hand positions obtained by two streams Faster R-CNN hand detector In the subsequent recognition stage, firstly, towards the gesture representation, a specific hand-oriented spatiotemporal (ST) feature is extracted for each isolated gesture video by 3D convolutional network (C3D). In this feature, only the hand regions and face location are considered, which can effectively block the negative influence of the distractors, such as the background, cloth and the body and so on. Next, the extracted features from calibrated RGB and depth channels are fused to boost the representative power and the final classification is achieved by using the simple linear SVM. Extensive experiments are conducted on the validation and testing sets of the Continuous Gesture Datasets (ConGD) to validate the effectiveness of the proposed recognition framework. Our method achieves the promising performance with the mean Jaccard Index of 0.6103 and outperforms other results in the ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.
{"title":"Continuous Gesture Recognition with Hand-Oriented Spatiotemporal Feature","authors":"Zhipeng Liu, Xiujuan Chai, Zhuang Liu, Xilin Chen","doi":"10.1109/ICCVW.2017.361","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.361","url":null,"abstract":"In this paper, an efficient spotting-recognition framework is proposed to tackle the large scale continuous gesture recognition problem with the RGB-D data input. Concretely, continuous gestures are firstly segmented into isolated gestures based on the accurate hand positions obtained by two streams Faster R-CNN hand detector In the subsequent recognition stage, firstly, towards the gesture representation, a specific hand-oriented spatiotemporal (ST) feature is extracted for each isolated gesture video by 3D convolutional network (C3D). In this feature, only the hand regions and face location are considered, which can effectively block the negative influence of the distractors, such as the background, cloth and the body and so on. Next, the extracted features from calibrated RGB and depth channels are fused to boost the representative power and the final classification is achieved by using the simple linear SVM. Extensive experiments are conducted on the validation and testing sets of the Continuous Gesture Datasets (ConGD) to validate the effectiveness of the proposed recognition framework. Our method achieves the promising performance with the mean Jaccard Index of 0.6103 and outperforms other results in the ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126838765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantinos Avgerinakis, Panagiotis Giannakeris, A. Briassouli, A. Karakostas, S. Vrochidis, Y. Kompatsiaris
The analysis of dynamic scenes in video is a very useful task especially for the detection and monitoring of natural hazards such as floods and fires. In this work, we focus on the challenging problem of real-world dynamic scene understanding, where videos contain dynamic textures that have been recorded in the "wild". These videos feature large illumination variations, complex motion, occlusions, camera motion, as well as significant intra-class differences, as the motion patterns of dynamic textures of the same category may be subject to large variations in real world recordings. We address these issues by introducing a novel dynamic texture descriptor, the "Local Binary Pattern-flow" (LBP-flow), which is shown to be able to accurately classify dynamic scenes whose complex motion patterns are difficult to separate using existing local descriptors, or which cannot be modelled by statistical techniques. LBP-flow builds upon existing Local Binary Pattern (LBP) descriptors by providing a low-cost representation of both appearance and optical flow textures, to increase its representation capabilities. The descriptor statistics are encoded with the Fisher vector, an informative mid-level descriptor, while a neural network follows to reduce the dimensionality and increase the discriminability of the encoded descriptor. The proposed algorithm leads to a highly accurate spatio-temporal descriptor which achieves a very low computational cost, enabling the deployment of our descriptor in real world surveillance and security applications. Experiments on challenging benchmark datasets demonstrate that it achieves recognition accuracy results that surpass State-of-the-Art dynamic texture descriptors.
{"title":"LBP-Flow and Hybrid Encoding for Real-Time Water and Fire Classification","authors":"Konstantinos Avgerinakis, Panagiotis Giannakeris, A. Briassouli, A. Karakostas, S. Vrochidis, Y. Kompatsiaris","doi":"10.1109/ICCVW.2017.56","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.56","url":null,"abstract":"The analysis of dynamic scenes in video is a very useful task especially for the detection and monitoring of natural hazards such as floods and fires. In this work, we focus on the challenging problem of real-world dynamic scene understanding, where videos contain dynamic textures that have been recorded in the \"wild\". These videos feature large illumination variations, complex motion, occlusions, camera motion, as well as significant intra-class differences, as the motion patterns of dynamic textures of the same category may be subject to large variations in real world recordings. We address these issues by introducing a novel dynamic texture descriptor, the \"Local Binary Pattern-flow\" (LBP-flow), which is shown to be able to accurately classify dynamic scenes whose complex motion patterns are difficult to separate using existing local descriptors, or which cannot be modelled by statistical techniques. LBP-flow builds upon existing Local Binary Pattern (LBP) descriptors by providing a low-cost representation of both appearance and optical flow textures, to increase its representation capabilities. The descriptor statistics are encoded with the Fisher vector, an informative mid-level descriptor, while a neural network follows to reduce the dimensionality and increase the discriminability of the encoded descriptor. The proposed algorithm leads to a highly accurate spatio-temporal descriptor which achieves a very low computational cost, enabling the deployment of our descriptor in real world surveillance and security applications. Experiments on challenging benchmark datasets demonstrate that it achieves recognition accuracy results that surpass State-of-the-Art dynamic texture descriptors.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123183593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic transcription is a well-known task in the music information retrieval (MIR) domain, and consists on the computation of a symbolic music representation (e.g. MIDI) from an audio recording. In this work, we address the automatic transcription of video recordings when the audio modality is missing or it does not have enough quality, and thus analyze the visual information. We focus on the clarinet which is played by opening/closing a set of holes and keys. We propose a method for automatic visual note estimation by detecting the fingertips of the player and measuring their displacement with respect to the holes and keys of the clarinet. To this aim, we track the clarinet and determine its position on every frame. The relative positions of the fingertips are used as features of a machine learning algorithm trained for note pitch classification. For that purpose, a dataset is built in a semiautomatic way by estimating pitch information from audio signals in an existing collection of 4.5 hours of video recordings from six different songs performed by nine different players. Our results confirm the difficulty of performing visual vs audio automatic transcription mainly due to motion blur and occlusions that cannot be solved with a single view.
{"title":"Visual Music Transcription of Clarinet Video Recordings Trained with Audio-Based Labelled Data","authors":"E. Gómez, P. Arias, Pablo Zinemanas, G. Haro","doi":"10.1109/ICCVW.2017.62","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.62","url":null,"abstract":"Automatic transcription is a well-known task in the music information retrieval (MIR) domain, and consists on the computation of a symbolic music representation (e.g. MIDI) from an audio recording. In this work, we address the automatic transcription of video recordings when the audio modality is missing or it does not have enough quality, and thus analyze the visual information. We focus on the clarinet which is played by opening/closing a set of holes and keys. We propose a method for automatic visual note estimation by detecting the fingertips of the player and measuring their displacement with respect to the holes and keys of the clarinet. To this aim, we track the clarinet and determine its position on every frame. The relative positions of the fingertips are used as features of a machine learning algorithm trained for note pitch classification. For that purpose, a dataset is built in a semiautomatic way by estimating pitch information from audio signals in an existing collection of 4.5 hours of video recordings from six different songs performed by nine different players. Our results confirm the difficulty of performing visual vs audio automatic transcription mainly due to motion blur and occlusions that cannot be solved with a single view.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126579325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic vectorization of fashion hand-drawn sketches is a crucial task performed by fashion industries to speed up their workflows. Performing vectorization on hand-drawn sketches is not an easy task, and it requires a first crucial step that consists in extracting precise and thin lines from sketches that are potentially very diverse (depending on the tool used and on the designer capabilities and preferences). This paper proposes a system for automatic vectorization of fashion hand-drawn sketches based on Pearson's Correlation Coefficient with multiple Gaussian kernels in order to enhance and extract curvilinear structures in a sketch. The use of correlation grants invariance to image contrast and lighting, making the extracted lines more reliable for vectorization. Moreover, the proposed algorithm has been designed to equally extract both thin and wide lines with changing stroke hardness, which are common in fashion hand-drawn sketches. It also works for crossing lines, adjacent parallel lines and needs very few parameters (if any) to run. The efficacy of the proposal has been demonstrated on both hand-drawn sketches and images with added artificial noise, showing in both cases excellent performance w.r.t. the state of the art.
{"title":"An Accurate System for Fashion Hand-Drawn Sketches Vectorization","authors":"Luca Donati, Simone Cesano, A. Prati","doi":"10.1109/ICCVW.2017.268","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.268","url":null,"abstract":"Automatic vectorization of fashion hand-drawn sketches is a crucial task performed by fashion industries to speed up their workflows. Performing vectorization on hand-drawn sketches is not an easy task, and it requires a first crucial step that consists in extracting precise and thin lines from sketches that are potentially very diverse (depending on the tool used and on the designer capabilities and preferences). This paper proposes a system for automatic vectorization of fashion hand-drawn sketches based on Pearson's Correlation Coefficient with multiple Gaussian kernels in order to enhance and extract curvilinear structures in a sketch. The use of correlation grants invariance to image contrast and lighting, making the extracted lines more reliable for vectorization. Moreover, the proposed algorithm has been designed to equally extract both thin and wide lines with changing stroke hardness, which are common in fashion hand-drawn sketches. It also works for crossing lines, adjacent parallel lines and needs very few parameters (if any) to run. The efficacy of the proposal has been demonstrated on both hand-drawn sketches and images with added artificial noise, showing in both cases excellent performance w.r.t. the state of the art.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126034190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karthikeyan Vaiapury, B. Purushothaman, A. Pal, Swapna Agarwal
The paper propose a cognitive inspired change detection method for the detection and localization of shape variations on point clouds. A well defined pipeline is introduced by proposing a coarse to fine approach: i) shape segmentation, ii) fine segment registration using attention blocks. Shape segmentation is obtained using covariance based method and fine segment registration is carried out using gravitational registration algorithm. In particular the introduction of this partition-based approach using visual attention mechanism improves the speed of deformation detection and localization. Some results are shown on synthetic data of house and aircraft models. Experimental results shows that this simple yet effective approach designed with an eye to scalability can detect and localize the deformation in a faster manner. A real world car use case is also presented with some preliminary promising results useful for auditing and insurance claim tasks.
{"title":"Can We Speed up 3D Scanning? A Cognitive and Geometric Analysis","authors":"Karthikeyan Vaiapury, B. Purushothaman, A. Pal, Swapna Agarwal","doi":"10.1109/ICCVW.2017.317","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.317","url":null,"abstract":"The paper propose a cognitive inspired change detection method for the detection and localization of shape variations on point clouds. A well defined pipeline is introduced by proposing a coarse to fine approach: i) shape segmentation, ii) fine segment registration using attention blocks. Shape segmentation is obtained using covariance based method and fine segment registration is carried out using gravitational registration algorithm. In particular the introduction of this partition-based approach using visual attention mechanism improves the speed of deformation detection and localization. Some results are shown on synthetic data of house and aircraft models. Experimental results shows that this simple yet effective approach designed with an eye to scalability can detect and localize the deformation in a faster manner. A real world car use case is also presented with some preliminary promising results useful for auditing and insurance claim tasks.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123848713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Structured light 3D surface scanners are usually comprised of one projector and of one camera which provide a limited view of the object's surface. Multiple projectors and cameras must be used to reconstruct the whole surface profile. Using multiple projectors in structured light profilometry is a challenging problem due to inter-projector interferences which make pattern separation difficult. We propose the use of sinusoidal fringe patterns where each projector has its own specifically chosen set of temporal phase shifts which together comprise a DFT2P+1 basis, where P is the number of projectors. Such a choice enables simple and efficient separation between projected patterns. The proposed method does not impose a limit on the number of projectors used and does not impose a limit on the projector placement. We demonstrate the applicability of the proposed method on three projectors and six cameras structured light system for human body scanning.
{"title":"Efficient Separation Between Projected Patterns for Multiple Projector 3D People Scanning","authors":"T. Petković, T. Pribanić, M. Donlic, P. Sturm","doi":"10.1109/ICCVW.2017.101","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.101","url":null,"abstract":"Structured light 3D surface scanners are usually comprised of one projector and of one camera which provide a limited view of the object's surface. Multiple projectors and cameras must be used to reconstruct the whole surface profile. Using multiple projectors in structured light profilometry is a challenging problem due to inter-projector interferences which make pattern separation difficult. We propose the use of sinusoidal fringe patterns where each projector has its own specifically chosen set of temporal phase shifts which together comprise a DFT2P+1 basis, where P is the number of projectors. Such a choice enables simple and efficient separation between projected patterns. The proposed method does not impose a limit on the number of projectors used and does not impose a limit on the projector placement. We demonstrate the applicability of the proposed method on three projectors and six cameras structured light system for human body scanning.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121363028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gait as a biometric feature has been investigated for human identification and biometric application. However, gait is highly dependent on the view angle. Therefore, the proposed gait features do not perform well when a person is changing his/her orientation towards camera. To tackle this problem, we propose a new method to learn low-dimensional view-invariant gait feature for person identification/verification. We model a gait observed by several different points of view as a Gaussian distribution and then utilize a function of Joint Bayesian as a regularizer coupled with the main objective function of non-negative matrix factorization to map gait features into a low-dimensional space. This process leads to an informative gait feature that can be used in a verification task. The performed experiments on a large gait dataset confirms the strength of the proposed method.
{"title":"View-Invariant Gait Representation Using Joint Bayesian Regularized Non-negative Matrix Factorization","authors":"M. Babaee, G. Rigoll","doi":"10.1109/ICCVW.2017.303","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.303","url":null,"abstract":"Gait as a biometric feature has been investigated for human identification and biometric application. However, gait is highly dependent on the view angle. Therefore, the proposed gait features do not perform well when a person is changing his/her orientation towards camera. To tackle this problem, we propose a new method to learn low-dimensional view-invariant gait feature for person identification/verification. We model a gait observed by several different points of view as a Gaussian distribution and then utilize a function of Joint Bayesian as a regularizer coupled with the main objective function of non-negative matrix factorization to map gait features into a low-dimensional space. This process leads to an informative gait feature that can be used in a verification task. The performed experiments on a large gait dataset confirms the strength of the proposed method.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127765858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}