Donatello Conte, P. Foggia, G. Percannella, M. Vento
This paper presents a method for counting people in a scene by establishing a mapping between some scene features and the number of people avoiding the complex foreground detection problem. The method is based on the use of SURF features and of an [left] [right]-SVR regressor to provide an estimate of this count. The algorithm takes specifically into account problems due to partial occlusions and to perspective.
{"title":"A Method Based on the Indirect Approach for Counting People in Crowded Scenes","authors":"Donatello Conte, P. Foggia, G. Percannella, M. Vento","doi":"10.1109/AVSS.2010.86","DOIUrl":"https://doi.org/10.1109/AVSS.2010.86","url":null,"abstract":"This paper presents a method for counting people in a scene by establishing a mapping between some scene features and the number of people avoiding the complex foreground detection problem. The method is based on the use of SURF features and of an [left] [right]-SVR regressor to provide an estimate of this count. The algorithm takes specifically into account problems due to partial occlusions and to perspective.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121884096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human action recognition is an important research areain the field of computer vision having a great number ofreal-world applications. This paper presents a multi-viewaction recognition framework that extracts human silhouetteclues from different cameras, analyzes scene dynamicsand interprets human behaviors by the integration of multivariatedata in fuzzy rule-based system. Different featureshave been considered for the player action recognition someof them concerning the human silhouette analysis, and someothers related to the ball and player kinematics. Experimentswere carried out on a multi view image sequences ofa public soccer data set.
{"title":"Soccer Player Activity Recognition by a Multivariate Features Integration","authors":"T. D’orazio, Marco Leo, P. Mazzeo, P. Spagnolo","doi":"10.1109/AVSS.2010.62","DOIUrl":"https://doi.org/10.1109/AVSS.2010.62","url":null,"abstract":"Human action recognition is an important research areain the field of computer vision having a great number ofreal-world applications. This paper presents a multi-viewaction recognition framework that extracts human silhouetteclues from different cameras, analyzes scene dynamicsand interprets human behaviors by the integration of multivariatedata in fuzzy rule-based system. Different featureshave been considered for the player action recognition someof them concerning the human silhouette analysis, and someothers related to the ball and player kinematics. Experimentswere carried out on a multi view image sequences ofa public soccer data set.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125008170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, a novel filter for real-time object trackingfrom compressed domain is presented and evaluated. Thefilter significantly reduces the noisy motion vectors, that donot represent a real object movement, from Mpeg familycompressed videos. The filter analyses the spatial (neighborhood)and temporal coherence of block motion vectorsto determine if they are likely to represent true motion fromthe recorded scene. Qualitative and quantitative experimentsare performed displaying that the proposed spatiotemporalfilter (STF) outperforms the currently widelyused vector median filter. The results obtained with the spatiotemporalfilter make it suitable as a first step of any systemthat aims to detect and track objects from compressedvideo using its motion vectors.
{"title":"A Spatiotemporal Motion-Vector Filter for Object Tracking on Compressed Video","authors":"Ronaldo C. Moura, E. M. Hemerly","doi":"10.1109/AVSS.2010.82","DOIUrl":"https://doi.org/10.1109/AVSS.2010.82","url":null,"abstract":"In this paper, a novel filter for real-time object trackingfrom compressed domain is presented and evaluated. Thefilter significantly reduces the noisy motion vectors, that donot represent a real object movement, from Mpeg familycompressed videos. The filter analyses the spatial (neighborhood)and temporal coherence of block motion vectorsto determine if they are likely to represent true motion fromthe recorded scene. Qualitative and quantitative experimentsare performed displaying that the proposed spatiotemporalfilter (STF) outperforms the currently widelyused vector median filter. The results obtained with the spatiotemporalfilter make it suitable as a first step of any systemthat aims to detect and track objects from compressedvideo using its motion vectors.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121053955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, most action recognition researches focuson isolated action analysis for short videos, but ignore theissue of continuous action recognition for a long videosequence in real time. This paper proposes a novelapproach for human action recognition in a video sequencewith whatever length, which, unlike previous works,requires no annotations and no pre-temporal-segmentations.Based on the bag of words representation and theprobabilistic Latent Semantic Analysis (pLSA) model, therecognition process goes frame by frame and the decisionupdates from time to time. Experimental results show thatthis approach is effective to recognize both isolated actionsand continuous actions no matter how long a videosequence is. This is very useful for real time applicationslike video surveillance. Besides, we also test our approachfor real time temporal video segmentation and real time keyframe extraction.
{"title":"Real Time Human Action Recognition in a Long Video Sequence","authors":"Ping Guo, Z. Miao, Yuan Shen, Heng-Da Cheng","doi":"10.1109/AVSS.2010.44","DOIUrl":"https://doi.org/10.1109/AVSS.2010.44","url":null,"abstract":"In recent years, most action recognition researches focuson isolated action analysis for short videos, but ignore theissue of continuous action recognition for a long videosequence in real time. This paper proposes a novelapproach for human action recognition in a video sequencewith whatever length, which, unlike previous works,requires no annotations and no pre-temporal-segmentations.Based on the bag of words representation and theprobabilistic Latent Semantic Analysis (pLSA) model, therecognition process goes frame by frame and the decisionupdates from time to time. Experimental results show thatthis approach is effective to recognize both isolated actionsand continuous actions no matter how long a videosequence is. This is very useful for real time applicationslike video surveillance. Besides, we also test our approachfor real time temporal video segmentation and real time keyframe extraction.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133875431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Maludrottu, M. Beoldo, M. Alvarez, C. Regazzoni
Real-time automatic human behavior recognition is oneof the most challenging tasks for intelligent surveillancesystems. Its importance lies in the possibility of robust detectionof suspicious behaviors in order to prevent possiblethreats. The widespread integration of tracking algorithmsinto modern surveillance systems makes it possible to acquiredescriptive motion patterns of different human activities.In this work, a statistical framework for human interactionrecognition based on Dynamic Bayesian Networks(DBNs) is presented: the environment is partitioned by atopological algorithm into a set of zones that are used to definethe state of the DBNs. Interactive and non-interactivebehaviors are described in terms of sequences of significantmotion events in the topological map of the environment.Finally, by means of an incremental classification measure,a scenario can be classified while it is currently evolving.In this way an autonomous surveillance system can detectand cope with potential threats in real-time.
{"title":"A Bayesian Framework for Online Interaction Classification","authors":"S. Maludrottu, M. Beoldo, M. Alvarez, C. Regazzoni","doi":"10.1109/AVSS.2010.56","DOIUrl":"https://doi.org/10.1109/AVSS.2010.56","url":null,"abstract":"Real-time automatic human behavior recognition is oneof the most challenging tasks for intelligent surveillancesystems. Its importance lies in the possibility of robust detectionof suspicious behaviors in order to prevent possiblethreats. The widespread integration of tracking algorithmsinto modern surveillance systems makes it possible to acquiredescriptive motion patterns of different human activities.In this work, a statistical framework for human interactionrecognition based on Dynamic Bayesian Networks(DBNs) is presented: the environment is partitioned by atopological algorithm into a set of zones that are used to definethe state of the DBNs. Interactive and non-interactivebehaviors are described in terms of sequences of significantmotion events in the topological map of the environment.Finally, by means of an incremental classification measure,a scenario can be classified while it is currently evolving.In this way an autonomous surveillance system can detectand cope with potential threats in real-time.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129751417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many surveillance systems there is a requirement todetermine whether a given person of interest has alreadybeen observed over a network of cameras. This is the personre-identification problem. The human appearance obtainedin one camera is usually different from the ones obtained inanother camera. In order to re-identify people the humansignature should handle difference in illumination, pose andcamera parameters. We propose a new appearance modelbased on spatial covariance regions extracted from humanbody parts. The new spatial pyramid scheme is applied tocapture the correlation between human body parts in orderto obtain a discriminative human signature. The humanbody parts are automatically detected using Histograms ofOriented Gradients (HOG). The method is evaluated usingbenchmark video sequences from i-LIDS Multiple-CameraTracking Scenario data set. The re-identification performanceis presented using the cumulative matching characteristic(CMC) curve. Finally, we show that the proposedapproach outperforms state of the art methods.
{"title":"Person Re-identification Using Spatial Covariance Regions of Human Body Parts","authors":"Sławomir Bąk, E. Corvée, F. Brémond, M. Thonnat","doi":"10.1109/AVSS.2010.34","DOIUrl":"https://doi.org/10.1109/AVSS.2010.34","url":null,"abstract":"In many surveillance systems there is a requirement todetermine whether a given person of interest has alreadybeen observed over a network of cameras. This is the personre-identification problem. The human appearance obtainedin one camera is usually different from the ones obtained inanother camera. In order to re-identify people the humansignature should handle difference in illumination, pose andcamera parameters. We propose a new appearance modelbased on spatial covariance regions extracted from humanbody parts. The new spatial pyramid scheme is applied tocapture the correlation between human body parts in orderto obtain a discriminative human signature. The humanbody parts are automatically detected using Histograms ofOriented Gradients (HOG). The method is evaluated usingbenchmark video sequences from i-LIDS Multiple-CameraTracking Scenario data set. The re-identification performanceis presented using the cumulative matching characteristic(CMC) curve. Finally, we show that the proposedapproach outperforms state of the art methods.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133190388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The use of single and dual-camera approaches tolocating a subject in a 3-D cluttered space is investigated.Specifically, we investigate the case where the lowerportion of the body may be occluded, e.g., by a chair on abus. Experiments were conducted involving elevensubjects moving along a pre-designated route within acluttered space. For each time instant the position of eachsubject was manually estimated and compared to thatproduced automatically. The dual camera approach wasfound to give significantly better performance than thesingle camera approach. It was found that inaccuratebounding of the lowest part of the subject, due toocclusion, led to localisation errors in range as large as10m for the latter. Using the side bounds of the detectedobject, which were found to be robust, accurate azimuthestimates can be obtained for a single camera. The dualcameraapproach exploits the greater degree of accuracyin azimuth to estimate the range through triangulation,giving average localisation errors of 40cm over the spaceof interest.
{"title":"Human Localization in a Cluttered Space Using Multiple Cameras","authors":"Jiali Shen, Weiqi Yan, P. Miller, Huiyu Zhou","doi":"10.1109/AVSS.2010.60","DOIUrl":"https://doi.org/10.1109/AVSS.2010.60","url":null,"abstract":"The use of single and dual-camera approaches tolocating a subject in a 3-D cluttered space is investigated.Specifically, we investigate the case where the lowerportion of the body may be occluded, e.g., by a chair on abus. Experiments were conducted involving elevensubjects moving along a pre-designated route within acluttered space. For each time instant the position of eachsubject was manually estimated and compared to thatproduced automatically. The dual camera approach wasfound to give significantly better performance than thesingle camera approach. It was found that inaccuratebounding of the lowest part of the subject, due toocclusion, led to localisation errors in range as large as10m for the latter. Using the side bounds of the detectedobject, which were found to be robust, accurate azimuthestimates can be obtained for a single camera. The dualcameraapproach exploits the greater degree of accuracyin azimuth to estimate the range through triangulation,giving average localisation errors of 40cm over the spaceof interest.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127739505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We proposes an unsupervised method to address videoobject extraction (VOE) in uncontrolled videos, i.e. videoscaptured by low-resolution and freely moving cameras. Weadvocate the use of dense optical-flow trajectories (DOTs),which are obtained by propagating the optical flow informationat the pixel level. Therefore, no interest point extractionis required in our framework. To integrate colorand and shape information of moving objects, we groupthe DOTs at the super-pixel level to extract co-motion regions,and use the associated pyramid histogram of orientedgradients (PHOG) descriptors to extract objects of interestacross video frames. Our approach for VOE is easy to implement,and the use of DOTs for both motion segmentationand object tracking is more robust than existing trajectorybasedmethods. Experiments on several video sequencesexhibit the feasibility of our proposed VOE framework.
{"title":"Learning Dense Optical-Flow Trajectory Patterns for Video Object Extraction","authors":"Wang-Chou Lu, Y. Wang, Chu-Song Chen","doi":"10.1109/AVSS.2010.79","DOIUrl":"https://doi.org/10.1109/AVSS.2010.79","url":null,"abstract":"We proposes an unsupervised method to address videoobject extraction (VOE) in uncontrolled videos, i.e. videoscaptured by low-resolution and freely moving cameras. Weadvocate the use of dense optical-flow trajectories (DOTs),which are obtained by propagating the optical flow informationat the pixel level. Therefore, no interest point extractionis required in our framework. To integrate colorand and shape information of moving objects, we groupthe DOTs at the super-pixel level to extract co-motion regions,and use the associated pyramid histogram of orientedgradients (PHOG) descriptors to extract objects of interestacross video frames. Our approach for VOE is easy to implement,and the use of DOTs for both motion segmentationand object tracking is more robust than existing trajectorybasedmethods. Experiments on several video sequencesexhibit the feasibility of our proposed VOE framework.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121104183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel, deterministic framework toextract the traffic state of an intersection with high reliabilityand in real-time. The multiple video cameras and inductiveloops at the intersection are fused on a common planewhich consists of a satellite map. The sensors are registeredfrom a CAD map of the intersection that is aligned on thesatellite map. The cameras are calibrated to provide themapping equations that project the detected vehicle positionsonto the coordinate system of the satellite map. Weuse a night time vehicle detection algorithm to process thecamera frames. The inductive loops confirm or reject thevehicle tracks measured by the cameras, and the fusion ofcamera and loop provides an additional feature : the vehiclelength. A Kalman filter linearly tracks the vehicles alongthe lanes. Over time, this filter reduces the noise presentin the measurements. The advantage of this approach isthat the detected vehicles and their parameters acquire avery high confidence, which brings almost 100% accuracyof the traffic state. An empirical evaluation is performedon a testbed intersection. We show the improvement of thisframework over single sensor frameworks.
{"title":"Bringing Richer Information with Reliability to Automated Traffic Monitoring from the Fusion of Multiple Cameras, Inductive Loops and Road Maps","authors":"Kostia Robert","doi":"10.1109/AVSS.2010.67","DOIUrl":"https://doi.org/10.1109/AVSS.2010.67","url":null,"abstract":"This paper presents a novel, deterministic framework toextract the traffic state of an intersection with high reliabilityand in real-time. The multiple video cameras and inductiveloops at the intersection are fused on a common planewhich consists of a satellite map. The sensors are registeredfrom a CAD map of the intersection that is aligned on thesatellite map. The cameras are calibrated to provide themapping equations that project the detected vehicle positionsonto the coordinate system of the satellite map. Weuse a night time vehicle detection algorithm to process thecamera frames. The inductive loops confirm or reject thevehicle tracks measured by the cameras, and the fusion ofcamera and loop provides an additional feature : the vehiclelength. A Kalman filter linearly tracks the vehicles alongthe lanes. Over time, this filter reduces the noise presentin the measurements. The advantage of this approach isthat the detected vehicles and their parameters acquire avery high confidence, which brings almost 100% accuracyof the traffic state. An empirical evaluation is performedon a testbed intersection. We show the improvement of thisframework over single sensor frameworks.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124427559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an effective real-time approach forautomatically estimating 3D human body poses frommonocular video sequences. In this approach, human bodyis automatically detected from video sequence, then imagefeatures such as silhouette, edge and color are extractedand integrated to infer 3D human poses by iterativelyminimizing the cost function defined between 2D featuresderived from the projected 3D model and those extractedfrom video sequence. In addition, 2D locations of head,hands, and feet are tracked to facilitate 3D tracking. Whentracking failure happens, the approach can detect andrecover from failures quickly. Finally, the efficiency androbustness of the proposed approach is shown in two realapplications: human event detection and video gaming.
{"title":"Real-Time 3D Human Pose Estimation from Monocular View with Applications to Event Detection and Video Gaming","authors":"Shian-Ru Ke, Liang-Jia Zhu, Jenq-Neng Hwang, Hung-I Pai, Kung-Ming Lan, C. Liao","doi":"10.1109/AVSS.2010.80","DOIUrl":"https://doi.org/10.1109/AVSS.2010.80","url":null,"abstract":"We present an effective real-time approach forautomatically estimating 3D human body poses frommonocular video sequences. In this approach, human bodyis automatically detected from video sequence, then imagefeatures such as silhouette, edge and color are extractedand integrated to infer 3D human poses by iterativelyminimizing the cost function defined between 2D featuresderived from the projected 3D model and those extractedfrom video sequence. In addition, 2D locations of head,hands, and feet are tracked to facilitate 3D tracking. Whentracking failure happens, the approach can detect andrecover from failures quickly. Finally, the efficiency androbustness of the proposed approach is shown in two realapplications: human event detection and video gaming.","PeriodicalId":415758,"journal":{"name":"2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance","volume":"16 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121001785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}