This paper deals with the camera control problem in automated video surveillance. We develop a solution that seeks to optimize the overall subject recognition probability by controlling the pan, tilt, and zoom of various deployed Pan/Tilt/Zoom (PTZ) cameras. Since the number of subjects is usually much larger than the number of video cameras, the problem to be addressed is how to assign subjects to these cameras. This control of cameras is based on the direction of the subject's movement and its location, distances from the cameras, occlusion, overall recognition probability so far, and the expected time to leave the site, as well as the movements of cameras and their capabilities and limitations. The developed solution works with realistic 3D environments and not just 2D scenes. We analyze the effectiveness of the proposed solution through extensive simulation.
{"title":"Efficient Control of PTZ Cameras in Automated Video Surveillance Systems","authors":"Musab S. Al-Hadrusi, Nabil J. Sarhan","doi":"10.1109/ISM.2012.72","DOIUrl":"https://doi.org/10.1109/ISM.2012.72","url":null,"abstract":"This paper deals with the camera control problem in automated video surveillance. We develop a solution that seeks to optimize the overall subject recognition probability by controlling the pan, tilt, and zoom of various deployed Pan/Tilt/Zoom (PTZ) cameras. Since the number of subjects is usually much larger than the number of video cameras, the problem to be addressed is how to assign subjects to these cameras. This control of cameras is based on the direction of the subject's movement and its location, distances from the cameras, occlusion, overall recognition probability so far, and the expected time to leave the site, as well as the movements of cameras and their capabilities and limitations. The developed solution works with realistic 3D environments and not just 2D scenes. We analyze the effectiveness of the proposed solution through extensive simulation.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121274269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mina Makar, Sam S. Tsai, V. Chandrasekhar, David M. Chen, B. Girod
Local features are widely used for content-based image retrieval and augmented reality applications. Typically, feature descriptors are calculated from the gradients of a canonical patch around a repeatable key point in the image. In previous work, we showed that one can alternatively transmit the compressed canonical patch and perform descriptor computation at the receiving end with comparable performance. In this paper, we propose a temporally coherent key point detector in order to allow efficient interframe coding of canonical patches. In inter-patch compression, one strives to transmit each patch with as few bits as possible by simply modifying a previously transmitted patch. This enables server-based mobile augmented reality where a continuous stream of salient information, sufficient for the image-based retrieval, can be sent over a wireless link at the smallest possible bit-rate. Experimental results show that our technique achieves a similar image matching performance at 1/10 of the bit-rate when compared to detecting key points independently frame-by-frame.
{"title":"Interframe Coding of Canonical Patches for Mobile Augmented Reality","authors":"Mina Makar, Sam S. Tsai, V. Chandrasekhar, David M. Chen, B. Girod","doi":"10.1109/ISM.2012.18","DOIUrl":"https://doi.org/10.1109/ISM.2012.18","url":null,"abstract":"Local features are widely used for content-based image retrieval and augmented reality applications. Typically, feature descriptors are calculated from the gradients of a canonical patch around a repeatable key point in the image. In previous work, we showed that one can alternatively transmit the compressed canonical patch and perform descriptor computation at the receiving end with comparable performance. In this paper, we propose a temporally coherent key point detector in order to allow efficient interframe coding of canonical patches. In inter-patch compression, one strives to transmit each patch with as few bits as possible by simply modifying a previously transmitted patch. This enables server-based mobile augmented reality where a continuous stream of salient information, sufficient for the image-based retrieval, can be sent over a wireless link at the smallest possible bit-rate. Experimental results show that our technique achieves a similar image matching performance at 1/10 of the bit-rate when compared to detecting key points independently frame-by-frame.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116949200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-10DOI: 10.1142/S1793351X13400023
Zhibing Xie, L. Guan
This paper focuses on the application of novel information theoretic tools in the area of information fusion. Feature transformation and fusion is critical for the performance of information fusion, however the majority of the existing works depend on the second order statistics, which is only optimal for Gaussian-like distribution. In this paper, the integration of information fusion techniques and kernel entropy component analysis provides a new information theoretic tool. The fusion of features is realized using descriptor of information entropy and optimized by entropy estimation. A novel multimodal information fusion strategy of audio emotion recognition based on kernel entropy component analysis (KECA) has been presented. The effectiveness of the proposed solution is evaluated though experimentation on two audiovisual emotion databases. Experimental results show that the proposed solution outperforms the existing methods, especially when the dimension of feature space is substantially reduced. The proposed method offers general theoretical analysis which gives us an approach to implement information theory into multimedia research.
{"title":"Multimodal Information Fusion of Audio Emotion Recognition Based on Kernel Entropy Component Analysis","authors":"Zhibing Xie, L. Guan","doi":"10.1142/S1793351X13400023","DOIUrl":"https://doi.org/10.1142/S1793351X13400023","url":null,"abstract":"This paper focuses on the application of novel information theoretic tools in the area of information fusion. Feature transformation and fusion is critical for the performance of information fusion, however the majority of the existing works depend on the second order statistics, which is only optimal for Gaussian-like distribution. In this paper, the integration of information fusion techniques and kernel entropy component analysis provides a new information theoretic tool. The fusion of features is realized using descriptor of information entropy and optimized by entropy estimation. A novel multimodal information fusion strategy of audio emotion recognition based on kernel entropy component analysis (KECA) has been presented. The effectiveness of the proposed solution is evaluated though experimentation on two audiovisual emotion databases. Experimental results show that the proposed solution outperforms the existing methods, especially when the dimension of feature space is substantially reduced. The proposed method offers general theoretical analysis which gives us an approach to implement information theory into multimedia research.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127816792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For the classification of logo images, there are significant challenges in the classification of merchandise logos such that only a few key points can be found in the relatively small logo images due to large variations in texture, poor illumination and generally, lack of discriminative features. This paper addresses these difficulties by introducing an integrated approach to classify merchandise logos with the combination of local edge-based descriptor-DAISY, spatial histogram and salient region detection. During the training phase, after carrying out the edge extraction, merchandise logos are described with a set of SIFT-like DAISY descriptors which is computed efficiently and densely along edge pixels. Visual word vocabulary generation and spatial histogram are used for describing the images/regions. Saliency map for object detection is adopted to narrow down and localize the logos. The feature map for approximating a non-linear kernel is also used to facilitate the classification by a linear SVM classifier. The experimental results demonstrate that the Edge-based DAISY (EDAISY) descriptor outperforms the state-of-the-art SIFT and DSIFT descriptors in terms of classification accuracy on a set of collected logo image dataset.
{"title":"Logo Classification with Edge-Based DAISY Descriptor","authors":"B. Lei, V. Thing, Yu Chen, Wee-Yong Lim","doi":"10.1109/ISM.2012.50","DOIUrl":"https://doi.org/10.1109/ISM.2012.50","url":null,"abstract":"For the classification of logo images, there are significant challenges in the classification of merchandise logos such that only a few key points can be found in the relatively small logo images due to large variations in texture, poor illumination and generally, lack of discriminative features. This paper addresses these difficulties by introducing an integrated approach to classify merchandise logos with the combination of local edge-based descriptor-DAISY, spatial histogram and salient region detection. During the training phase, after carrying out the edge extraction, merchandise logos are described with a set of SIFT-like DAISY descriptors which is computed efficiently and densely along edge pixels. Visual word vocabulary generation and spatial histogram are used for describing the images/regions. Saliency map for object detection is adopted to narrow down and localize the logos. The feature map for approximating a non-linear kernel is also used to facilitate the classification by a linear SVM classifier. The experimental results demonstrate that the Edge-based DAISY (EDAISY) descriptor outperforms the state-of-the-art SIFT and DSIFT descriptors in terms of classification accuracy on a set of collected logo image dataset.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134298823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahesh Babu Mariappan, Myunghoon Suk, B. Prabhakaran
Recognition of facial expressions of users allows researchers to build context-aware applications that adapt according to the users' emotional states. Facial expression recognition is an active area of research in the computer vision community. In this paper, we present Face Fetch, a novel context-based multimedia content recommendation system that understands a user's current emotional state (happiness, sadness, fear, disgust, surprise and anger) through facial expression recognition and recommends multimedia content to the user. Our system can understand a user's emotional state through a desktop as well as a mobile user interface and pull multimedia content such as music, movies and other videos of interest to the user from the cloud with near real time performance.
{"title":"FaceFetch: A User Emotion Driven Multimedia Content Recommendation System Based on Facial Expression Recognition","authors":"Mahesh Babu Mariappan, Myunghoon Suk, B. Prabhakaran","doi":"10.1109/ISM.2012.24","DOIUrl":"https://doi.org/10.1109/ISM.2012.24","url":null,"abstract":"Recognition of facial expressions of users allows researchers to build context-aware applications that adapt according to the users' emotional states. Facial expression recognition is an active area of research in the computer vision community. In this paper, we present Face Fetch, a novel context-based multimedia content recommendation system that understands a user's current emotional state (happiness, sadness, fear, disgust, surprise and anger) through facial expression recognition and recommends multimedia content to the user. Our system can understand a user's emotional state through a desktop as well as a mobile user interface and pull multimedia content such as music, movies and other videos of interest to the user from the cloud with near real time performance.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128751767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With image databases growing rapidly, efficient methods for content-based image retrieval (CBIR) are highly sought after. In this paper, we present a very fast method for filtering JPEG compressed images to discard irrelevant pictures. We show that compressing images using individually optimised quantisation tables not only maintains high image quality and therefore allows for improved compression rates, but that the quantisation tables themselves provide a useful image descriptor for CBIR. Visual similarity between images can thus be expressed as similarity between their quantisation tables. As these are stored in the JPEG header, feature extraction and similarity computation can be performed extremely fast, and we consequently employ our method as an initial filtering step for a subsequent CBIR algorithm. We show, on a benchmark dataset of more than 30,000 images, that we can filter 80% or more of the images without a drop in retrieval performance while reducing the online retrieval time by a factor of at about 5.
{"title":"Efficient Filtering of JPEG Images","authors":"David Edmundson, G. Schaefer","doi":"10.1109/ISM.2012.88","DOIUrl":"https://doi.org/10.1109/ISM.2012.88","url":null,"abstract":"With image databases growing rapidly, efficient methods for content-based image retrieval (CBIR) are highly sought after. In this paper, we present a very fast method for filtering JPEG compressed images to discard irrelevant pictures. We show that compressing images using individually optimised quantisation tables not only maintains high image quality and therefore allows for improved compression rates, but that the quantisation tables themselves provide a useful image descriptor for CBIR. Visual similarity between images can thus be expressed as similarity between their quantisation tables. As these are stored in the JPEG header, feature extraction and similarity computation can be performed extremely fast, and we consequently employ our method as an initial filtering step for a subsequent CBIR algorithm. We show, on a benchmark dataset of more than 30,000 images, that we can filter 80% or more of the images without a drop in retrieval performance while reducing the online retrieval time by a factor of at about 5.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129586366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose an online image search engine based on local image features (key points), which runs fully on GPUs. State-of-the-art visual image retrieval techniques are based on bag-of-visual-words (BoV) model, which is an analogy for text-based search. In BoV, each key point is rounded off to the nearest visual word. On the other hand in this work, thanks to the vector computation power of GPUs, we utilize real values of key point descriptors. We match key points in two steps. The idea in the first step is similar to visual word matching in BoV. In the second step, we do matching in key point level. By keeping identities of each key point, closest key points are accurately retrieved in real-time. Image search has different characteristics than textual search. We implement one-to-one key point matching, which is more natural for images. Our experiments reveal 265 times speed up for offline index generation, 104 times speedup for online index search and 20.5 times speedup for online key point matching time, when compared to the CPU implementation. Our proposed key point-matching-based search improves accuracy of BoV by 9.5%.
{"title":"GPU-Enabled High Performance Online Visual Search with High Accuracy","authors":"Ali Cevahir, Junji Torii","doi":"10.1109/ISM.2012.85","DOIUrl":"https://doi.org/10.1109/ISM.2012.85","url":null,"abstract":"We propose an online image search engine based on local image features (key points), which runs fully on GPUs. State-of-the-art visual image retrieval techniques are based on bag-of-visual-words (BoV) model, which is an analogy for text-based search. In BoV, each key point is rounded off to the nearest visual word. On the other hand in this work, thanks to the vector computation power of GPUs, we utilize real values of key point descriptors. We match key points in two steps. The idea in the first step is similar to visual word matching in BoV. In the second step, we do matching in key point level. By keeping identities of each key point, closest key points are accurately retrieved in real-time. Image search has different characteristics than textual search. We implement one-to-one key point matching, which is more natural for images. Our experiments reveal 265 times speed up for offline index generation, 104 times speedup for online index search and 20.5 times speedup for online key point matching time, when compared to the CPU implementation. Our proposed key point-matching-based search improves accuracy of BoV by 9.5%.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117325299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic video editing is a hot topic due to the rapid growth of video usages. In this paper, we present a cloud-based tool and an approach of automatic video editing based on keywords extracted from the audio transcription. Using texts transcript of the audio, video sequences are selected and chained to create automatically a new video with a time duration fixed by the user. A cloud based video editor allows users to collaboratively edit the video.
{"title":"A Cloud-Based Collaborative and Automatic Video Editor","authors":"A. Outtagarts, Abderrazagh Mbodj","doi":"10.1109/ISM.2012.78","DOIUrl":"https://doi.org/10.1109/ISM.2012.78","url":null,"abstract":"Automatic video editing is a hot topic due to the rapid growth of video usages. In this paper, we present a cloud-based tool and an approach of automatic video editing based on keywords extracted from the audio transcription. Using texts transcript of the audio, video sequences are selected and chained to create automatically a new video with a time duration fixed by the user. A cloud based video editor allows users to collaboratively edit the video.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123818854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Zhong, Sen Wang, Minwoo Park, Rodney L. Miller, Dimitris N. Metaxas
Automatically synthesizing 3D content from a causal monocular video has become an important problem. Previous works either use no geometry information, or rely on precise 3D geometry information. Therefore, they cannot obtain reasonable results if the 3D structure in the scene is complex, or noisy 3D geometry information is estimated from monocular videos. In this paper, we present an automatic and robust framework to synthesize stereoscopic videos from casual 2D monocular videos. First, 3D geometry information (e.g., camera parameters, depth map) are extracted from the 2D input video. Then a Bayesian-based View Synthesis (BVS) approach is proposed to render high-quality new virtual views for stereoscopic video to deal with noisy 3D geometry information. Extensive experiments on various videos demonstrate that BVS can synthesize more accurate views than other methods, and our proposed framework also be able to generate high-quality 3D videos.
{"title":"Towards Automatic Stereoscopic Video Synthesis from a Casual Monocular Video","authors":"Lin Zhong, Sen Wang, Minwoo Park, Rodney L. Miller, Dimitris N. Metaxas","doi":"10.1109/ISM.2012.64","DOIUrl":"https://doi.org/10.1109/ISM.2012.64","url":null,"abstract":"Automatically synthesizing 3D content from a causal monocular video has become an important problem. Previous works either use no geometry information, or rely on precise 3D geometry information. Therefore, they cannot obtain reasonable results if the 3D structure in the scene is complex, or noisy 3D geometry information is estimated from monocular videos. In this paper, we present an automatic and robust framework to synthesize stereoscopic videos from casual 2D monocular videos. First, 3D geometry information (e.g., camera parameters, depth map) are extracted from the 2D input video. Then a Bayesian-based View Synthesis (BVS) approach is proposed to render high-quality new virtual views for stereoscopic video to deal with noisy 3D geometry information. Extensive experiments on various videos demonstrate that BVS can synthesize more accurate views than other methods, and our proposed framework also be able to generate high-quality 3D videos.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"45 18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130782676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-Spectro-Temporal Curvature Scale Space (MST-CSS) had been proposed as a video content descriptor in an earlier work, where the peak and saddle points were used for feature points. But these are inadequate to capture the salient features of the MST-CSS surface, producing poor retrieval results. To overcome these, we propose EMST-CSS (Enhanced MST-CSS) as a better feature representation with an improved matching method for CBVR (Content Based Video Retrieval). Comparative study with the existing MST-CSS representation and two state-of-the-art methods for CBVR shows enhanced performance on one synthetic and two real-world datasets.
{"title":"Enhancing the MST-CSS Representation Using Robust Geometric Features, for Efficient Content Based Video Retrieval (CBVR)","authors":"C. Chattopadhyay, Sukhendu Das","doi":"10.1109/ISM.2012.71","DOIUrl":"https://doi.org/10.1109/ISM.2012.71","url":null,"abstract":"Multi-Spectro-Temporal Curvature Scale Space (MST-CSS) had been proposed as a video content descriptor in an earlier work, where the peak and saddle points were used for feature points. But these are inadequate to capture the salient features of the MST-CSS surface, producing poor retrieval results. To overcome these, we propose EMST-CSS (Enhanced MST-CSS) as a better feature representation with an improved matching method for CBVR (Content Based Video Retrieval). Comparative study with the existing MST-CSS representation and two state-of-the-art methods for CBVR shows enhanced performance on one synthetic and two real-world datasets.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114312070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}