Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711554
Tao Wang, Rui Li, Zhigang Zhu, Yufu Qu
Laser Doppler Vibrometers (LDVs) have been widely applied for detecting vibrations in applications such as mechanics, bridge inspection, biometrics, as well as long-range surveillance in which acoustic signatures can be obtained at a large distance. However, in both industrial and scientific applications, the LDVs are manually controlled in surface selection, laser focusing, and acoustic acquisition. In this paper, we propose an active stereo vision approach to facilitate fast and automated laser pointing and tracking for long-range LDV hearing. The system contains: 1) a mirror on a Pan-Tilt-Unit (PTU) to reflect the laser beam to any locations freely and quickly, and 2) two Pan-Tilt-Zoom (PTZ) cameras, one of which is mounted on the Pan-Tilt-Unit (PTU) and aligned with the laser beam synchronously. The distance measurement using the stereo vision system as well as triangulation between camera and the LDV laser beam allow us to fast focus the laser beam on selected surfaces and to obtain acoustic signals up to 200 meters in real time. We present some promising results with the collaborative visual and LDV measurements for laser pointing and focusing in order to achieve long range audio detection.
{"title":"Active stereo vision for improving long range hearing using a Laser Doppler Vibrometer","authors":"Tao Wang, Rui Li, Zhigang Zhu, Yufu Qu","doi":"10.1109/WACV.2011.5711554","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711554","url":null,"abstract":"Laser Doppler Vibrometers (LDVs) have been widely applied for detecting vibrations in applications such as mechanics, bridge inspection, biometrics, as well as long-range surveillance in which acoustic signatures can be obtained at a large distance. However, in both industrial and scientific applications, the LDVs are manually controlled in surface selection, laser focusing, and acoustic acquisition. In this paper, we propose an active stereo vision approach to facilitate fast and automated laser pointing and tracking for long-range LDV hearing. The system contains: 1) a mirror on a Pan-Tilt-Unit (PTU) to reflect the laser beam to any locations freely and quickly, and 2) two Pan-Tilt-Zoom (PTZ) cameras, one of which is mounted on the Pan-Tilt-Unit (PTU) and aligned with the laser beam synchronously. The distance measurement using the stereo vision system as well as triangulation between camera and the LDV laser beam allow us to fast focus the laser beam on selected surfaces and to obtain acoustic signals up to 200 meters in real time. We present some promising results with the collaborative visual and LDV measurements for laser pointing and focusing in order to achieve long range audio detection.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131737338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711523
Austin Abrams, Emily Feder, Robert Pless
In surveillance and environmental monitoring applications, it is common to have millions of images of a particular scene. While there exist tools to find particular events, anomalies, human actions and behaviors, there has been little investigation of tools which allow more exploratory searches in the data. This paper proposes modifications to PCA that enable users to quickly recompute low-rank decompositions for select spatial and temporal subsets of the data. This process returns decompositions orders of magnitude faster than general PCA and are close to optimal in terms of reconstruction error. We show examples of real exploratory data analysis across several applications, including an interactive web application.
{"title":"Exploratory analysis of time-lapse imagery with fast subset PCA","authors":"Austin Abrams, Emily Feder, Robert Pless","doi":"10.1109/WACV.2011.5711523","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711523","url":null,"abstract":"In surveillance and environmental monitoring applications, it is common to have millions of images of a particular scene. While there exist tools to find particular events, anomalies, human actions and behaviors, there has been little investigation of tools which allow more exploratory searches in the data. This paper proposes modifications to PCA that enable users to quickly recompute low-rank decompositions for select spatial and temporal subsets of the data. This process returns decompositions orders of magnitude faster than general PCA and are close to optimal in terms of reconstruction error. We show examples of real exploratory data analysis across several applications, including an interactive web application.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134173100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711565
T. Dinh, G. Medioni
Partial occlusion is a challenging problem in object tracking. In online visual tracking, it is the critical factor causing drift. To address this problem, we propose a novel approach using a co-training framework of generative and discriminative trackers. Our approach is able to detect the occluding region and continuously update both the generative and discriminative models using the information from the non-occluded part. The generative model encodes all of the appearance variations using a low dimension subspace, which helps provide a strong reacquisition ability. Meanwhile, the discriminative classifer, an online support vector machine, focuses on separating the object from the background using a Histograms of Oriented Gradients (HOG) feature set. For each search window, an occlusion likelihood map is generated by the two trackers through a co-decision process. If there is disagreement between these two trackers, the movement vote of KLT local features is used as a referee. Precise occlusion segmentation is performed using MeanShift. Finally, each tracker recovers the occluded part and updates its own model using the new non-occluded information. Experimental results on challenging sequences with different types of objects are presented. We also compare with other state-of-the-art methods to demonstrate the superiority and robustness of our tracking framework.
{"title":"Co-training framework of generative and discriminative trackers with partial occlusion handling","authors":"T. Dinh, G. Medioni","doi":"10.1109/WACV.2011.5711565","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711565","url":null,"abstract":"Partial occlusion is a challenging problem in object tracking. In online visual tracking, it is the critical factor causing drift. To address this problem, we propose a novel approach using a co-training framework of generative and discriminative trackers. Our approach is able to detect the occluding region and continuously update both the generative and discriminative models using the information from the non-occluded part. The generative model encodes all of the appearance variations using a low dimension subspace, which helps provide a strong reacquisition ability. Meanwhile, the discriminative classifer, an online support vector machine, focuses on separating the object from the background using a Histograms of Oriented Gradients (HOG) feature set. For each search window, an occlusion likelihood map is generated by the two trackers through a co-decision process. If there is disagreement between these two trackers, the movement vote of KLT local features is used as a referee. Precise occlusion segmentation is performed using MeanShift. Finally, each tracker recovers the occluded part and updates its own model using the new non-occluded information. Experimental results on challenging sequences with different types of objects are presented. We also compare with other state-of-the-art methods to demonstrate the superiority and robustness of our tracking framework.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131025792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711496
Noppharit Tongprasit, Aram Kawewong, O. Hasegawa
This paper presents a fast, online, and incremental solution for an appearance-based loop closure detection problem in an indoor environment. This problem is important in terms of the navigation of mobile robots. Appearance-based Simultaneous Localization And Mapping (SLAM) for a highly dynamic environment, called Position Invariant Robust Feature Navigation (PIRF-Nav), was first proposed by Kawewong et al. in 2010. Their results showed major improvements from other state-of-the-art methods. However, the computational expense of PIRF-Nav is beyond real time, and it consumes a tremendous amount of memory. These two factors hinder the use of PIRF-Nav for mobile robot applications. This study proposed (i) modified PIRF extraction that makes the system more suitable for an indoor environment and (ii) new dictionary management that can eliminate redundant searching and conserve memory consumption. According to the results, our proposed method can finish tasks up to 12 times faster than PIRF-Nav, with only slight percentage decline in a recall, while the precision remains 1. In addition, for a more challenging task, we collected additional data from a crowded university canteen during lunch time. Even in this cluttered environment, our proposed method performs better with real-time processing compared with other methods.
{"title":"PIRF-Nav 2: Speeded-up online and incremental appearance-based SLAM in an indoor environment","authors":"Noppharit Tongprasit, Aram Kawewong, O. Hasegawa","doi":"10.1109/WACV.2011.5711496","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711496","url":null,"abstract":"This paper presents a fast, online, and incremental solution for an appearance-based loop closure detection problem in an indoor environment. This problem is important in terms of the navigation of mobile robots. Appearance-based Simultaneous Localization And Mapping (SLAM) for a highly dynamic environment, called Position Invariant Robust Feature Navigation (PIRF-Nav), was first proposed by Kawewong et al. in 2010. Their results showed major improvements from other state-of-the-art methods. However, the computational expense of PIRF-Nav is beyond real time, and it consumes a tremendous amount of memory. These two factors hinder the use of PIRF-Nav for mobile robot applications. This study proposed (i) modified PIRF extraction that makes the system more suitable for an indoor environment and (ii) new dictionary management that can eliminate redundant searching and conserve memory consumption. According to the results, our proposed method can finish tasks up to 12 times faster than PIRF-Nav, with only slight percentage decline in a recall, while the precision remains 1. In addition, for a more challenging task, we collected additional data from a crowded university canteen during lunch time. Even in this cluttered environment, our proposed method performs better with real-time processing compared with other methods.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124486807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711541
D. Tjondronegoro, Xiaohui Tao, Johannes Sasongko, C. Lau
To detect and annotate the key events of live sports videos, we need to tackle the semantic gaps of audio-visual information. Previous work has successfully extracted semantic from the time-stamped web match reports, which are synchronized with the video contents. However, web and social media articles with no time-stamps have not been fully leveraged, despite they are increasingly used to complement the coverage of major sporting tournaments. This paper aims to address this limitation using a novel multimodal summarization framework that is based on sentiment analysis and players' popularity. It uses audiovisual contents, web articles, blogs, and commentators' speech to automatically annotate and visualize the key events and key players in a sports tournament coverage. The experimental results demonstrate that the automatically generated video summaries are aligned with the events identified from the official website match reports.
{"title":"Multi-modal summarization of key events and top players in sports tournament videos","authors":"D. Tjondronegoro, Xiaohui Tao, Johannes Sasongko, C. Lau","doi":"10.1109/WACV.2011.5711541","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711541","url":null,"abstract":"To detect and annotate the key events of live sports videos, we need to tackle the semantic gaps of audio-visual information. Previous work has successfully extracted semantic from the time-stamped web match reports, which are synchronized with the video contents. However, web and social media articles with no time-stamps have not been fully leveraged, despite they are increasingly used to complement the coverage of major sporting tournaments. This paper aims to address this limitation using a novel multimodal summarization framework that is based on sentiment analysis and players' popularity. It uses audiovisual contents, web articles, blogs, and commentators' speech to automatically annotate and visualize the key events and key players in a sports tournament coverage. The experimental results demonstrate that the automatically generated video summaries are aligned with the events identified from the official website match reports.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121933451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711508
S. Fenker, K. Bowyer
It has been widely accepted that iris biometric systems are not subject to a template aging effect. Baker et al. [1] recently presented the first published evidence of a template aging effect, using images acquired from 2004 through 2008 with an LG 2200 iris imaging system, representing a total of 13 subjects (26 irises). We report on a template aging study involving two different iris recognition algorithms, a larger number of subjects (43), a more modern imaging system (LG 4000), and over a shorter time-lapse (2 years). We also investigate the degree to which the template aging effect may be related to pupil dilation and/or contact lenses. We find evidence of a template aging effect, resulting in an increase in match hamming distance and false reject rate.
{"title":"Experimental evidence of a template aging effect in iris biometrics","authors":"S. Fenker, K. Bowyer","doi":"10.1109/WACV.2011.5711508","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711508","url":null,"abstract":"It has been widely accepted that iris biometric systems are not subject to a template aging effect. Baker et al. [1] recently presented the first published evidence of a template aging effect, using images acquired from 2004 through 2008 with an LG 2200 iris imaging system, representing a total of 13 subjects (26 irises). We report on a template aging study involving two different iris recognition algorithms, a larger number of subjects (43), a more modern imaging system (LG 4000), and over a shorter time-lapse (2 years). We also investigate the degree to which the template aging effect may be related to pupil dilation and/or contact lenses. We find evidence of a template aging effect, resulting in an increase in match hamming distance and false reject rate.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132188611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711571
T. Senst, Volker Eiselein, Rubén Heras Evangelio, T. Sikora
This paper describes a robust method for the local optical flow estimation and the KLT feature tracking performed on the GPU. Therefore we present an estimator based on the L2 norm with robust characteristics. In order to increase the robustness at discontinuities we propose a strategy to adapt the used region size. The GPU implementation of our approach achieves real-time (>25 fps) performance for High Definition (HD) video sequences while tracking several thousands of points. The benefit of the suggested enhancement is illustrated on the Middlebury optical flow benchmark.
{"title":"Robust modified L2 local optical flow estimation and feature tracking","authors":"T. Senst, Volker Eiselein, Rubén Heras Evangelio, T. Sikora","doi":"10.1109/WACV.2011.5711571","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711571","url":null,"abstract":"This paper describes a robust method for the local optical flow estimation and the KLT feature tracking performed on the GPU. Therefore we present an estimator based on the L2 norm with robust characteristics. In order to increase the robustness at discontinuities we propose a strategy to adapt the used region size. The GPU implementation of our approach achieves real-time (>25 fps) performance for High Definition (HD) video sequences while tracking several thousands of points. The benefit of the suggested enhancement is illustrated on the Middlebury optical flow benchmark.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127598262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711480
Samuel Felix de Sousa, A. Araújo, D. Menotti
Sports video analysis has received special attention from researchers due to its high popularity and general interest on semantic analysis. Hence, soccer videos represent an interesting field for research allowing many types of applications: indexing, summarization, players' behavior recognition and so forth. Many approaches have been applied for field extraction and recognition, arc and goalmouth detection, ball and players tracking, and high level techniques such as team tactics detection and soccer models definition. In this paper, we provide an hierarchy and we classify approaches into this hierarchy based on their analysis level, i.e., low, middle, and high levels. An overview of soccer event identification is presented and we discuss general issues related to it in order to provide relevant information about what has been done on soccer video processing.
{"title":"An overview of automatic event detection in soccer matches","authors":"Samuel Felix de Sousa, A. Araújo, D. Menotti","doi":"10.1109/WACV.2011.5711480","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711480","url":null,"abstract":"Sports video analysis has received special attention from researchers due to its high popularity and general interest on semantic analysis. Hence, soccer videos represent an interesting field for research allowing many types of applications: indexing, summarization, players' behavior recognition and so forth. Many approaches have been applied for field extraction and recognition, arc and goalmouth detection, ball and players tracking, and high level techniques such as team tactics detection and soccer models definition. In this paper, we provide an hierarchy and we classify approaches into this hierarchy based on their analysis level, i.e., low, middle, and high levels. An overview of soccer event identification is presented and we discuss general issues related to it in order to provide relevant information about what has been done on soccer video processing.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116994108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711518
T. Senst, Rubén Heras Evangelio, T. Sikora
Detecting people carrying objects is a commonly formulated problem as a first step to monitor interactions between people and objects. Recent work relies on a precise foreground object segmentation, which is often difficult to achieve in video surveillance sequences due to a bad contrast of the foreground objects with the scene background, abrupt changing light conditions and small camera vibrations. In order to cope with these difficulties we propose an approach based on motion statistics. Therefore we use a Gaussian mixture motion model (GMMM) and, based on that model, we define a novel speed and direction independent motion descriptor in order to detect carried baggage as those regions not fitting in the motion description model of an average walking person. The system was tested with the public dataset PETS2006 and a more challenging dataset including abrupt lighting changes and bad color contrast and compared with existing systems, showing very promissing results.
{"title":"Detecting people carrying objects based on an optical flow motion model","authors":"T. Senst, Rubén Heras Evangelio, T. Sikora","doi":"10.1109/WACV.2011.5711518","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711518","url":null,"abstract":"Detecting people carrying objects is a commonly formulated problem as a first step to monitor interactions between people and objects. Recent work relies on a precise foreground object segmentation, which is often difficult to achieve in video surveillance sequences due to a bad contrast of the foreground objects with the scene background, abrupt changing light conditions and small camera vibrations. In order to cope with these difficulties we propose an approach based on motion statistics. Therefore we use a Gaussian mixture motion model (GMMM) and, based on that model, we define a novel speed and direction independent motion descriptor in order to detect carried baggage as those regions not fitting in the motion description model of an average walking person. The system was tested with the public dataset PETS2006 and a more challenging dataset including abrupt lighting changes and bad color contrast and compared with existing systems, showing very promissing results.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126534092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-01-05DOI: 10.1109/WACV.2011.5711558
B. Dellen, G. Alenyà, S. Foix, C. Torras
We present a new method for segmenting color images into their composite surfaces by combining color segmentation with model-based fitting utilizing sparse depth data, acquired using time-of-flight (Swissranger, PMD CamCube) and stereo techniques. The main target of our work is the segmentation of plant structures, i.e., leaves, from color-depth images, and the extraction of color and 3D shape information for automating manipulation tasks. Since segmentation is performed in the dense color space, even sparse, incomplete, or noisy depth information can be used. This kind of data often represents a major challenge for methods operating in the 3D data space directly. To achieve our goal, we construct a three-stage segmentation hierarchy by segmenting the color image with different resolutions-assuming that “true” surface boundaries must appear at some point along the segmentation hierarchy. 3D surfaces are then fitted to the color-segment areas using depth data. Those segments which minimize the fitting error are selected and used to construct a new segmentation. Then, an additional region merging and a growing stage are applied to avoid over-segmentation and label previously unclustered points. Experimental results demonstrate that the method is successful in segmenting a variety of domestic objects and plants into quadratic surfaces. At the end of the procedure, the sparse depth data is completed using the extracted surface models, resulting in dense depth maps. For stereo, the resulting disparity maps are compared with ground truth and the average error is computed.
{"title":"Segmenting color images into surface patches by exploiting sparse depth data","authors":"B. Dellen, G. Alenyà, S. Foix, C. Torras","doi":"10.1109/WACV.2011.5711558","DOIUrl":"https://doi.org/10.1109/WACV.2011.5711558","url":null,"abstract":"We present a new method for segmenting color images into their composite surfaces by combining color segmentation with model-based fitting utilizing sparse depth data, acquired using time-of-flight (Swissranger, PMD CamCube) and stereo techniques. The main target of our work is the segmentation of plant structures, i.e., leaves, from color-depth images, and the extraction of color and 3D shape information for automating manipulation tasks. Since segmentation is performed in the dense color space, even sparse, incomplete, or noisy depth information can be used. This kind of data often represents a major challenge for methods operating in the 3D data space directly. To achieve our goal, we construct a three-stage segmentation hierarchy by segmenting the color image with different resolutions-assuming that “true” surface boundaries must appear at some point along the segmentation hierarchy. 3D surfaces are then fitted to the color-segment areas using depth data. Those segments which minimize the fitting error are selected and used to construct a new segmentation. Then, an additional region merging and a growing stage are applied to avoid over-segmentation and label previously unclustered points. Experimental results demonstrate that the method is successful in segmenting a variety of domestic objects and plants into quadratic surfaces. At the end of the procedure, the sparse depth data is completed using the extracted surface models, resulting in dense depth maps. For stereo, the resulting disparity maps are compared with ground truth and the average error is computed.","PeriodicalId":424724,"journal":{"name":"2011 IEEE Workshop on Applications of Computer Vision (WACV)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125477865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}