Masaki Hayashi, Kyoko Oshima, Masamoto Tanabiki, Y. Aoki
We propose a per-frame upper body pose estimation method for sports players captured in low-resolution team sports videos. Using the head-center-aligned upper body region appearance in each frame from the head tracker, our framework estimates (1) 2D spine pose, composed of the head center and the pelvis center locations, and (2) the orientation of the upper body in each frame. Our framework is composed of three steps. In the first step, the head region of the subject player is tracked with a standard tracking-by-detection technique for upper body appearance alignment. In the second step, the relative pelvis center location from the head center is estimated by our newly proposed poseletregressor in each frame to obtain spine angle priors. In the last step, the body orientation is estimated by the upper body orientation classifier selected by the spine angle range. Owing to the alignment of the body appearance and the usage of multiple body orientation classifiers conditioned by the spine angle prior, our method can robustly estimate the body orientation of a player with a large variation of visual appearances during a game, even during side-poses or self-occluded poses. We tested the performance of our method in both American football and soccer videos.
{"title":"Upper Body Pose Estimation for Team Sports Videos Using a Poselet-Regressor of Spine Pose and Body Orientation Classifiers Conditioned by the Spine Angle Prior","authors":"Masaki Hayashi, Kyoko Oshima, Masamoto Tanabiki, Y. Aoki","doi":"10.2197/ipsjtcva.7.121","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.121","url":null,"abstract":"We propose a per-frame upper body pose estimation method for sports players captured in low-resolution team sports videos. Using the head-center-aligned upper body region appearance in each frame from the head tracker, our framework estimates (1) 2D spine pose, composed of the head center and the pelvis center locations, and (2) the orientation of the upper body in each frame. Our framework is composed of three steps. In the first step, the head region of the subject player is tracked with a standard tracking-by-detection technique for upper body appearance alignment. In the second step, the relative pelvis center location from the head center is estimated by our newly proposed poseletregressor in each frame to obtain spine angle priors. In the last step, the body orientation is estimated by the upper body orientation classifier selected by the spine angle range. Owing to the alignment of the body appearance and the usage of multiple body orientation classifiers conditioned by the spine angle prior, our method can robustly estimate the body orientation of a player with a large variation of visual appearances during a game, even during side-poses or self-occluded poses. We tested the performance of our method in both American football and soccer videos.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"37 1","pages":"121-137"},"PeriodicalIF":0.0,"publicationDate":"2015-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84868702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a new application that improves communication between digital media and customers at a point of sale. The system uses several methods from various areas of computer vision such as motion detection, object tracking, behavior analysis and recognition, semantic description of behavior, and scenario recognition. Specifically, the system is divided in three parts: low-level, mid-level, and high-level analysis. Low-level analysis detects and tracks moving object in the scene. Then mid-level analysis describes and recognizes behavior of the tracked objects. Finally high-level analysis produces a semantic interpretation of the detected behavior and recognizes predefined scenarios. Our research is developed in order to build a real-time application that recognizes human behaviors while shopping. Specifically, the system detects customer interests and interactions with various products at a point of sale.
{"title":"Human Behavior Recognition in Shopping Settings","authors":"R. Sicre, H. Nicolas","doi":"10.2197/ipsjtcva.7.151","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.151","url":null,"abstract":"This paper presents a new application that improves communication between digital media and customers at a point of sale. The system uses several methods from various areas of computer vision such as motion detection, object tracking, behavior analysis and recognition, semantic description of behavior, and scenario recognition. Specifically, the system is divided in three parts: low-level, mid-level, and high-level analysis. Low-level analysis detects and tracks moving object in the scene. Then mid-level analysis describes and recognizes behavior of the tracked objects. Finally high-level analysis produces a semantic interpretation of the detected behavior and recognizes predefined scenarios. Our research is developed in order to build a real-time application that recognizes human behaviors while shopping. Specifically, the system detects customer interests and interactions with various products at a point of sale.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"5 1","pages":"151-162"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87905917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In moving camera videos, motion segmentation is often achieved by determining the motion coherence of each moving object. However, it is a nontrivial task on optical flow due to two problems: 1) Optical flow of the camera motions in 3D world consists of three primary 2D motion flows: translation, rotation, and radial flow. Their coherence analysis is done by a variety of models, and further requires plenty of priors in existing frameworks; 2) A moving camera introduces 3D motion, the depth discontinuities cause the motion discontinuities that severely break down the coherence. Meanwhile, the mixture of the camera motion and moving objects’ motions make it difficult to clearly identify foreground and background. In this work, our solution is to transform the optical flow into a potential space where the coherence of the background flow field is easily modeled by a low order polynomial. To this end, we first amend the Helmholts-Hodge Decomposition by adding coherence constraints, which can transform translation, rotation, and radial flow fields to two potential surfaces under a unified framework. Secondly, we introduce an Incoherence Map and a progressive Quad-Tree partition to reject moving objects and motion discontinuities. Finally, the low order polynomial is achieved from the rest flow samples on two potentials. We present results on more than twenty videos from four benchmarks. Extensive experiments demonstrate better performance in dealing with challenging scenes with complex backgrounds. Our method improves the segmentation accuracy of state-of-the-arts by 10%∼30%.
{"title":"A General Inlier Estimation for Moving Camera Motion Segmentation","authors":"Xuefeng Liang, Cuicui Zhang, T. Matsuyama","doi":"10.2197/ipsjtcva.7.163","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.163","url":null,"abstract":"In moving camera videos, motion segmentation is often achieved by determining the motion coherence of each moving object. However, it is a nontrivial task on optical flow due to two problems: 1) Optical flow of the camera motions in 3D world consists of three primary 2D motion flows: translation, rotation, and radial flow. Their coherence analysis is done by a variety of models, and further requires plenty of priors in existing frameworks; 2) A moving camera introduces 3D motion, the depth discontinuities cause the motion discontinuities that severely break down the coherence. Meanwhile, the mixture of the camera motion and moving objects’ motions make it difficult to clearly identify foreground and background. In this work, our solution is to transform the optical flow into a potential space where the coherence of the background flow field is easily modeled by a low order polynomial. To this end, we first amend the Helmholts-Hodge Decomposition by adding coherence constraints, which can transform translation, rotation, and radial flow fields to two potential surfaces under a unified framework. Secondly, we introduce an Incoherence Map and a progressive Quad-Tree partition to reject moving objects and motion discontinuities. Finally, the low order polynomial is achieved from the rest flow samples on two potentials. We present results on more than twenty videos from four benchmarks. Extensive experiments demonstrate better performance in dealing with challenging scenes with complex backgrounds. Our method improves the segmentation accuracy of state-of-the-arts by 10%∼30%.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"33 1","pages":"163-174"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86286320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nevrez Imamoglu, E. Dorronzoro, M. Sekine, K. Kita, Wenwei Yu
Saliency maps as visual attention computational models can reveal novel regions within a scene (as in the human visual system), which can decrease the amount of data to be processed in task specific computer vision applications. Most of the saliency computation models do not take advantage of prior spatial memory by giving priority to spatial or object based features to obtain bottom-up or top-down saliency maps. In our previous experiments, we demonstrated that spatial memory regardless of object features can aid detection and tracking tasks with a mobile robot by using a 2D global environment memory of the robot and local Kinect data in 2D to compute the space-based saliency map. However, in complex scenes where 2D space-based saliency is not enough (i.e., subject lying on the bed), 3D scene analysis is necessary to extract novelty within the scene by using spatial memory. Therefore, in this work, to improve the detection of novelty in a known environment, we proposed a space-based spatial saliency with 3D local information by improving 2D space base saliency with height as prior information about the specific locations. Moreover, the algorithm can also be integrated with other bottom-up or top-down saliency computational models to improve the detection results. Experimental results demonstrate that high accuracy for novelty detection can be obtained, and computational time can be reduced for existing state of the art detection and tracking models with the proposed algorithm.
{"title":"Spatial Visual Attention for Novelty Detection: A Space-based Saliency Model in 3D Using Spatial Memory","authors":"Nevrez Imamoglu, E. Dorronzoro, M. Sekine, K. Kita, Wenwei Yu","doi":"10.2197/ipsjtcva.7.35","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.35","url":null,"abstract":"Saliency maps as visual attention computational models can reveal novel regions within a scene (as in the human visual system), which can decrease the amount of data to be processed in task specific computer vision applications. Most of the saliency computation models do not take advantage of prior spatial memory by giving priority to spatial or object based features to obtain bottom-up or top-down saliency maps. In our previous experiments, we demonstrated that spatial memory regardless of object features can aid detection and tracking tasks with a mobile robot by using a 2D global environment memory of the robot and local Kinect data in 2D to compute the space-based saliency map. However, in complex scenes where 2D space-based saliency is not enough (i.e., subject lying on the bed), 3D scene analysis is necessary to extract novelty within the scene by using spatial memory. Therefore, in this work, to improve the detection of novelty in a known environment, we proposed a space-based spatial saliency with 3D local information by improving 2D space base saliency with height as prior information about the specific locations. Moreover, the algorithm can also be integrated with other bottom-up or top-down saliency computational models to improve the detection results. Experimental results demonstrate that high accuracy for novelty detection can be obtained, and computational time can be reduced for existing state of the art detection and tracking models with the proposed algorithm.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"158 1","pages":"35-40"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88859146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuki Takashima, Yasuhiro Kakihara, Ryo Aihara, T. Takiguchi, Y. Ariki, Nobuyuki Mitani, K. Omori, Kaoru Nakazono
In this paper, we propose an audio-visual speech recognition system for a person with an articulation disorder resulting from severe hearing loss. In the case of a person with this type of articulation disorder, the speech style is quite different from with the result that of people without hearing loss that a speaker-independent model for unimpaired persons is hardly useful for recognizing it. We investigate in this paper an audio-visual speech recognition system for a person with severe hearing loss in noisy environments, where a robust feature extraction method using a convolutive bottleneck network (CBN) is applied to audio-visual data. We confirmed the effectiveness of this approach through word-recognition experiments in noisy environments, where the CBN-based feature extraction method outperformed the conventional methods.
{"title":"Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss","authors":"Yuki Takashima, Yasuhiro Kakihara, Ryo Aihara, T. Takiguchi, Y. Ariki, Nobuyuki Mitani, K. Omori, Kaoru Nakazono","doi":"10.2197/ipsjtcva.7.64","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.64","url":null,"abstract":"In this paper, we propose an audio-visual speech recognition system for a person with an articulation disorder resulting from severe hearing loss. In the case of a person with this type of articulation disorder, the speech style is quite different from with the result that of people without hearing loss that a speaker-independent model for unimpaired persons is hardly useful for recognizing it. We investigate in this paper an audio-visual speech recognition system for a person with severe hearing loss in noisy environments, where a robust feature extraction method using a convolutive bottleneck network (CBN) is applied to audio-visual data. We confirmed the effectiveness of this approach through word-recognition experiments in noisy environments, where the CBN-based feature extraction method outperformed the conventional methods.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"26 1","pages":"64-68"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80606487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a mobile Lidar system for efficiently and accurately capturing the 3D shape of the Bas-reliefs in Angkor Wat. The sensor system consists of two main components: 1) a panoramic camera and 2) a 2D 360-degree laser line scanner, which moves slowly on the rails parallel to the reliefs. In this paper, we first propose a new but simple method to accurately calibrate the panoramic camera to the 2D laser scan lines. Then the sensor motion can be estimated from the sensor-fused system using the 2D/3D features tracking method. Furthermore, to reduce the drifting error of sensor motion we adopt bundle adjustment to globally optimize and smooth the moving trajectories. In experiments, we demonstrate that our moving Lidar system achieves substantially better performance for accuracy and efficiency in comparison to the traditional stop-and-go methods.
{"title":"Rail Sensor: A Mobile Lidar System for 3D Archiving the Bas-reliefs in Angkor Wat","authors":"Bo Zheng, Takeshi Oishi, K. Ikeuchi","doi":"10.2197/ipsjtcva.7.59","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.59","url":null,"abstract":"This paper presents a mobile Lidar system for efficiently and accurately capturing the 3D shape of the Bas-reliefs in Angkor Wat. The sensor system consists of two main components: 1) a panoramic camera and 2) a 2D 360-degree laser line scanner, which moves slowly on the rails parallel to the reliefs. In this paper, we first propose a new but simple method to accurately calibrate the panoramic camera to the 2D laser scan lines. Then the sensor motion can be estimated from the sensor-fused system using the 2D/3D features tracking method. Furthermore, to reduce the drifting error of sensor motion we adopt bundle adjustment to globally optimize and smooth the moving trajectories. In experiments, we demonstrate that our moving Lidar system achieves substantially better performance for accuracy and efficiency in comparison to the traditional stop-and-go methods.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"15 1","pages":"59-63"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90395149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new image denoising method with shrinkage. In the proposed method, small blocks in an input image are projected to the space that makes projection coefficients sparse, and the explicitly evaluated sparsity degree is used to control the shrinkage threshold. On average, the proposed method obtained higher quantitative evaluation values (PSNRs and SSIMs) compared with one of the state-of-the-art methods in the field of image denoising. The proposed method removes random noise effectively from natural images while preserving intricate textures.
{"title":"Image Denoising with Sparsity Distillation","authors":"S. Kawata, Nao Mishima","doi":"10.2197/ipsjtcva.7.50","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.50","url":null,"abstract":"We propose a new image denoising method with shrinkage. In the proposed method, small blocks in an input image are projected to the space that makes projection coefficients sparse, and the explicitly evaluated sparsity degree is used to control the shrinkage threshold. On average, the proposed method obtained higher quantitative evaluation values (PSNRs and SSIMs) compared with one of the state-of-the-art methods in the field of image denoising. The proposed method removes random noise effectively from natural images while preserving intricate textures.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"40 1","pages":"50-54"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75511738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qilin Zhang, G. Hua, W. Liu, Zicheng Liu, Zhengyou Zhang
In the realm of multi-modal visual recognition, the reliability of the data acquisition system is often a concern due to the increased complexity of the sensors. One of the major issues is the accidental loss of one or more sensing channels, which poses a major challenge to current learning systems. In this paper, we examine one of these specific missing data problems, where we have a main modality/view along with an auxiliary modality/view present in the training data, but merely the main modality/view in the test data. To effectively leverage the auxiliary information to train a stronger classifier, we propose a collaborative auxiliary learning framework based on a new discriminative canonical correlation analysis. This framework reveals a common semantic space shared across both modalities/views through enforcing a series of nonlinear projections. Such projections automatically embed the discriminative cues hidden in both modalities/views into the common space, and better visual recognition is thus achieved on the test data. The efficacy of our proposed auxiliary learning approach is demonstrated through four challenging visual recognition tasks with different kinds of auxiliary information.
{"title":"Auxiliary Training Information Assisted Visual Recognition","authors":"Qilin Zhang, G. Hua, W. Liu, Zicheng Liu, Zhengyou Zhang","doi":"10.2197/ipsjtcva.7.138","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.138","url":null,"abstract":"In the realm of multi-modal visual recognition, the reliability of the data acquisition system is often a concern due to the increased complexity of the sensors. One of the major issues is the accidental loss of one or more sensing channels, which poses a major challenge to current learning systems. In this paper, we examine one of these specific missing data problems, where we have a main modality/view along with an auxiliary modality/view present in the training data, but merely the main modality/view in the test data. To effectively leverage the auxiliary information to train a stronger classifier, we propose a collaborative auxiliary learning framework based on a new discriminative canonical correlation analysis. This framework reveals a common semantic space shared across both modalities/views through enforcing a series of nonlinear projections. Such projections automatically embed the discriminative cues hidden in both modalities/views into the common space, and better visual recognition is thus achieved on the test data. The efficacy of our proposed auxiliary learning approach is demonstrated through four challenging visual recognition tasks with different kinds of auxiliary information.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"75 1","pages":"138-150"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79495291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper investigates performances of silhouette-based and depth-based gait authentication considering practical sensor settings where sensors are located in an environments afterwards and usually have to be located quite near to people. To realize fair comparison between different sensors and methods, we construct full-body volume of walking people by a multi-camera environment so as to reconstruct virtual silhouette and depth images at arbitrary sensor positions. In addition, we also investigate performances when we have to authenticate between frontal and rear views. Experimental results confirm that the depth-based methods outperform the silhouette-based ones in the realistic situations. We also confirm that by introducing Depth-based Gait Feature, we can authenticate between the frontal and rear views.
{"title":"Depth-based Gait Authentication for Practical Sensor Settings","authors":"Taro Ikeda, Ikuhisa Mitsugami, Y. Yagi","doi":"10.2197/ipsjtcva.7.94","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.94","url":null,"abstract":"This paper investigates performances of silhouette-based and depth-based gait authentication considering practical sensor settings where sensors are located in an environments afterwards and usually have to be located quite near to people. To realize fair comparison between different sensors and methods, we construct full-body volume of walking people by a multi-camera environment so as to reconstruct virtual silhouette and depth images at arbitrary sensor positions. In addition, we also investigate performances when we have to authenticate between frontal and rear views. Experimental results confirm that the depth-based methods outperform the silhouette-based ones in the realistic situations. We also confirm that by introducing Depth-based Gait Feature, we can authenticate between the frontal and rear views.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"17 1","pages":"94-98"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86093642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yasushi Makihara, Takuya Tanoue, D. Muramatsu, Y. Yagi, Syunsuke Mori, Yuzuko Utsumi, M. Iwamura, K. Kise
Most gait recognition approaches rely on silhouette-based representations due to high recognition accu- racy and computational efficiency, and a key problem for those approaches is how to accurately extract individuality- preserved silhouettes from real scenes, where foreground colors may be similar to background colors and the back- groundis cluttered. We thereforeproposea method of individuality-preservingsilhouetteextractionfor gait recognition using standard gait models (SGMs) composed of clean silhouette sequences of a variety of training subjects as a shape prior. We firstly match the multiple SGMs to a background subtraction sequence of a test subject by dynamic pro- gramming and select the training subject whose SGM fit the test sequence the best. We then formulate our silhouette extraction problem in a well-established graph-cut segmentation framework while considering a balance between the observed test sequence and the matched SGM. More specifically, we define an energy function to be minimized by the following three terms: (1) a data term derived from the observed test sequence, (2) a smoothness term derived from spatio-temporally adjacent edges, and (3) a shape-prior term derived from the matched SGM. We demonstrate that the proposed method successfully extracts individuality-preserved silhouettes and improved gait recognition accuracy through experiments using 56 subjects.
{"title":"Individuality-preserving Silhouette Extraction for Gait Recognition","authors":"Yasushi Makihara, Takuya Tanoue, D. Muramatsu, Y. Yagi, Syunsuke Mori, Yuzuko Utsumi, M. Iwamura, K. Kise","doi":"10.2197/ipsjtcva.7.74","DOIUrl":"https://doi.org/10.2197/ipsjtcva.7.74","url":null,"abstract":"Most gait recognition approaches rely on silhouette-based representations due to high recognition accu- racy and computational efficiency, and a key problem for those approaches is how to accurately extract individuality- preserved silhouettes from real scenes, where foreground colors may be similar to background colors and the back- groundis cluttered. We thereforeproposea method of individuality-preservingsilhouetteextractionfor gait recognition using standard gait models (SGMs) composed of clean silhouette sequences of a variety of training subjects as a shape prior. We firstly match the multiple SGMs to a background subtraction sequence of a test subject by dynamic pro- gramming and select the training subject whose SGM fit the test sequence the best. We then formulate our silhouette extraction problem in a well-established graph-cut segmentation framework while considering a balance between the observed test sequence and the matched SGM. More specifically, we define an energy function to be minimized by the following three terms: (1) a data term derived from the observed test sequence, (2) a smoothness term derived from spatio-temporally adjacent edges, and (3) a shape-prior term derived from the matched SGM. We demonstrate that the proposed method successfully extracts individuality-preserved silhouettes and improved gait recognition accuracy through experiments using 56 subjects.","PeriodicalId":38957,"journal":{"name":"IPSJ Transactions on Computer Vision and Applications","volume":"20 1","pages":"74-78"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88202496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}