Although it has been extensively studied for many years, automatic image annotation is still a challenging problem. Recently, data-driven approaches have demonstrated their great success to image auto-annotation. Such approaches leverage abundant partially annotated web images to annotate an uncaptioned image. Specifically, they first retrieve a group of visually closely similar images given an uncaptioned image as a query, then figure out meaningful phrases from the surrounding texts of the image search results. Since the surrounding texts are generally noisy, how to effectively mine meaningful phrases is crucial for the success of such approaches. We propose a mixture modeling approach which assumes that a tag is generated from a convex combination of topics. Different from a typical topic modeling approach like LDA, topics in our approach are explicitly learnt from a definitive catalog of the Web, i.e. the Open Directory Project (ODP). Compared with previous works, it has two advantages: Firstly, it uses an open vocabulary rather than a limited one defined by a training set. Secondly, it is efficient for real-time annotation. Experimental results conducted on two billion web images show the efficiency and effectiveness of the proposed approach.
{"title":"Efficient Tag Mining via Mixture Modeling for Real-Time Search-Based Image Annotation","authors":"Lican Dai, Xin-Jing Wang, Lei Zhang, Nenghai Yu","doi":"10.1109/ICME.2012.104","DOIUrl":"https://doi.org/10.1109/ICME.2012.104","url":null,"abstract":"Although it has been extensively studied for many years, automatic image annotation is still a challenging problem. Recently, data-driven approaches have demonstrated their great success to image auto-annotation. Such approaches leverage abundant partially annotated web images to annotate an uncaptioned image. Specifically, they first retrieve a group of visually closely similar images given an uncaptioned image as a query, then figure out meaningful phrases from the surrounding texts of the image search results. Since the surrounding texts are generally noisy, how to effectively mine meaningful phrases is crucial for the success of such approaches. We propose a mixture modeling approach which assumes that a tag is generated from a convex combination of topics. Different from a typical topic modeling approach like LDA, topics in our approach are explicitly learnt from a definitive catalog of the Web, i.e. the Open Directory Project (ODP). Compared with previous works, it has two advantages: Firstly, it uses an open vocabulary rather than a limited one defined by a training set. Secondly, it is efficient for real-time annotation. Experimental results conducted on two billion web images show the efficiency and effectiveness of the proposed approach.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127638159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu-Chuan Su, Guan-Long Wu, Tzu-Hsuan Chiu, Winston H. Hsu, Kuo-Wei Chang
Recently, several Gaussian like image representations are proposed as an alternative of the bag-of-word representation over local features. These representations are proposed to overcome the quantization error problem faced in bag-of-word representation. They are shown to be effective in different applications, the Extended Hierarchical Gaussianization reached excellent performance using single feature in VOC2009, Vector of Locally Aggregated Descriptors and Fisher Kernel reached excellent performance using only signature like representation on Holiday dataset. Despite their success and similarity, no comparative study about these representations has been made. In this paper, we perform a systematic comparison about three emerging different gaussian like representations: Extended Hierarchical Gaussianization, Fisher Kernel and Vector of Locally Aggregated Descriptors. We evaluate the performance and the influence of feature and parameters of these representations on Holiday and CC_Web_Video datasets, and several important properties about these representations have been observed during our investigation. This study provides better understanding about these gaussian like image representations that are believed to be promising in various applications.
{"title":"Evaluating Gaussian Like Image Representations over Local Features","authors":"Yu-Chuan Su, Guan-Long Wu, Tzu-Hsuan Chiu, Winston H. Hsu, Kuo-Wei Chang","doi":"10.1109/ICME.2012.23","DOIUrl":"https://doi.org/10.1109/ICME.2012.23","url":null,"abstract":"Recently, several Gaussian like image representations are proposed as an alternative of the bag-of-word representation over local features. These representations are proposed to overcome the quantization error problem faced in bag-of-word representation. They are shown to be effective in different applications, the Extended Hierarchical Gaussianization reached excellent performance using single feature in VOC2009, Vector of Locally Aggregated Descriptors and Fisher Kernel reached excellent performance using only signature like representation on Holiday dataset. Despite their success and similarity, no comparative study about these representations has been made. In this paper, we perform a systematic comparison about three emerging different gaussian like representations: Extended Hierarchical Gaussianization, Fisher Kernel and Vector of Locally Aggregated Descriptors. We evaluate the performance and the influence of feature and parameters of these representations on Holiday and CC_Web_Video datasets, and several important properties about these representations have been observed during our investigation. This study provides better understanding about these gaussian like image representations that are believed to be promising in various applications.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126264007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Fu, Jinqiao Wang, Zechao Li, Hanqing Lu, Songde Ma
With the proliferation of cameras in public areas, it becomes increasingly desirable to develop fully automated surveillance and monitoring systems. In this paper, we propose a novel unsupervised approach to automatically explore motion patterns occurring in dynamic scenes under an improved sparse topical coding (STC) framework. Given an input video with a fixed camera, we first segment the whole video into a sequence of clips (documents) without overlapping. Optical flow features are extracted from each pair of consecutive frames, and quantized into discrete visual words. Then the video is represented by a word-document hierarchical topic model through a generative process. Finally, an improved sparse topical coding approach is proposed for model learning. The semantic motion patterns (latent topics) are learned automatically and each video clip is represented as a weighted summation of these patterns with only a few nonzero coefficients. The proposed approach is purely data-driven and scene independent (not an object-class specific), which make it suitable for very large range of scenarios. Experiments demonstrate that our approach outperforms the state-of-the art technologies in dynamic scene analysis.
{"title":"Learning Semantic Motion Patterns for Dynamic Scenes by Improved Sparse Topical Coding","authors":"Wei Fu, Jinqiao Wang, Zechao Li, Hanqing Lu, Songde Ma","doi":"10.1109/ICME.2012.133","DOIUrl":"https://doi.org/10.1109/ICME.2012.133","url":null,"abstract":"With the proliferation of cameras in public areas, it becomes increasingly desirable to develop fully automated surveillance and monitoring systems. In this paper, we propose a novel unsupervised approach to automatically explore motion patterns occurring in dynamic scenes under an improved sparse topical coding (STC) framework. Given an input video with a fixed camera, we first segment the whole video into a sequence of clips (documents) without overlapping. Optical flow features are extracted from each pair of consecutive frames, and quantized into discrete visual words. Then the video is represented by a word-document hierarchical topic model through a generative process. Finally, an improved sparse topical coding approach is proposed for model learning. The semantic motion patterns (latent topics) are learned automatically and each video clip is represented as a weighted summation of these patterns with only a few nonzero coefficients. The proposed approach is purely data-driven and scene independent (not an object-class specific), which make it suitable for very large range of scenarios. Experiments demonstrate that our approach outperforms the state-of-the art technologies in dynamic scene analysis.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130137220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Popularity prediction is a key problem in networks to analyze the information diffusion, especially in social media communities. Recently, there have been some custom-build prediction models in Digg and YouTube. However, these models are hardly transplant to an incomplete social network site (e.g., Flickr) by their unique parameters. In addition, because of the large scale of the network in Flickr, it is difficult to get all of the photos and the whole network. Thus, we are seeking for a method which can be used in such incomplete network. Inspired by a collaborative filtering method-Network-based Inference (NBI), we devise a weighted bipartite graph with undetected users and items to represent the resource allocation process in an incomplete network. Instead of image analysis, we propose a modified interdisciplinary models, called Incomplete Network-based Inference (INI). Using the data from 30 months in Flickr, we show the proposed INI is able to increase prediction accuracy by over 58.1%, compared with traditional NBI. We apply our proposed INI approach to personalized advertising application and show that it is more attractive than traditional Flickr advertising.
{"title":"Predicting Image Popularity in an Incomplete Social Media Community by a Weighted Bi-partite Graph","authors":"Xiang Niu, Lusong Li, Tao Mei, Jialie Shen, Ke Xu","doi":"10.1109/ICME.2012.43","DOIUrl":"https://doi.org/10.1109/ICME.2012.43","url":null,"abstract":"Popularity prediction is a key problem in networks to analyze the information diffusion, especially in social media communities. Recently, there have been some custom-build prediction models in Digg and YouTube. However, these models are hardly transplant to an incomplete social network site (e.g., Flickr) by their unique parameters. In addition, because of the large scale of the network in Flickr, it is difficult to get all of the photos and the whole network. Thus, we are seeking for a method which can be used in such incomplete network. Inspired by a collaborative filtering method-Network-based Inference (NBI), we devise a weighted bipartite graph with undetected users and items to represent the resource allocation process in an incomplete network. Instead of image analysis, we propose a modified interdisciplinary models, called Incomplete Network-based Inference (INI). Using the data from 30 months in Flickr, we show the proposed INI is able to increase prediction accuracy by over 58.1%, compared with traditional NBI. We apply our proposed INI approach to personalized advertising application and show that it is more attractive than traditional Flickr advertising.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134052487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John R. Zhang, J. Ren, Fangzhe Chang, Thomas L. Wood, J. Kender
This paper introduces a method for the efficient comparison and retrieval of near duplicates of a query video from a video database. The method generates video signatures from histograms of orientations of optical flow of feature points computed from uniformly sampled video frames concatenated over time to produce time series, which are then aligned and matched. Major incline matching, a data reduction and peak alignment method for time series, is adapted for faster performance. The resultant method is compact and robust against a number of common transformations including: flipping, cropping, picture-in-picture, photometric, addition of noise and other artifacts. We evaluate on the MUSCLE VCD 2007 dataset and a dataset derived from TRECVID 2009. Good precision (average 88.8%) at significantly higher speeds (average durations: 45 seconds for signature generation plus 92 seconds for a linear search of 81-second query video in a 300 hour dataset) than results reported in the literature are shown.
{"title":"Fast Near-Duplicate Video Retrieval via Motion Time Series Matching","authors":"John R. Zhang, J. Ren, Fangzhe Chang, Thomas L. Wood, J. Kender","doi":"10.1109/ICME.2012.111","DOIUrl":"https://doi.org/10.1109/ICME.2012.111","url":null,"abstract":"This paper introduces a method for the efficient comparison and retrieval of near duplicates of a query video from a video database. The method generates video signatures from histograms of orientations of optical flow of feature points computed from uniformly sampled video frames concatenated over time to produce time series, which are then aligned and matched. Major incline matching, a data reduction and peak alignment method for time series, is adapted for faster performance. The resultant method is compact and robust against a number of common transformations including: flipping, cropping, picture-in-picture, photometric, addition of noise and other artifacts. We evaluate on the MUSCLE VCD 2007 dataset and a dataset derived from TRECVID 2009. Good precision (average 88.8%) at significantly higher speeds (average durations: 45 seconds for signature generation plus 92 seconds for a linear search of 81-second query video in a 300 hour dataset) than results reported in the literature are shown.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133555628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an edge-directed, noniterative image interpolation algorithm. In the proposed algorithm, the gradient directions are explicitly estimated with a statistical-based approach. The local dominant gradient directions are obtained by using principal components analysis (PCA) on the four nearest gradients. The angles of the whole gradient plane are divided into four parts, and each gradient direction falls into one part. Then we implement the interpolation with one-dimention (1-D) cubic convolution interpolation perpendicular to the gradient direction. Compared to the state of-the-art interpolation methods, simulation results show that the proposed PCA-based edge-directed interpolation method preserves edges well while maintaining a high PSNR value.
{"title":"Principal Components Analysis-Based Edge-Directed Image Interpolation","authors":"Bing Yang, Zhiyong Gao, Xiaoyun Zhang","doi":"10.1109/ICME.2012.153","DOIUrl":"https://doi.org/10.1109/ICME.2012.153","url":null,"abstract":"This paper presents an edge-directed, noniterative image interpolation algorithm. In the proposed algorithm, the gradient directions are explicitly estimated with a statistical-based approach. The local dominant gradient directions are obtained by using principal components analysis (PCA) on the four nearest gradients. The angles of the whole gradient plane are divided into four parts, and each gradient direction falls into one part. Then we implement the interpolation with one-dimention (1-D) cubic convolution interpolation perpendicular to the gradient direction. Compared to the state of-the-art interpolation methods, simulation results show that the proposed PCA-based edge-directed interpolation method preserves edges well while maintaining a high PSNR value.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115549765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There has been much interest in offering multimedia location-based service (LBS) to indoor users (e.g., sending video/audio streams according to user locations). Offering good LBS largely depends on accurate indoor localization of mobile stations (MSs). To achieve that, in this paper we first model and analyze the error characteristics of important indoor localization schemes, using Radio Frequency Identification (RFID) and Wi-Fi. Our models are simple to use, capturing important system parameters and measurement noises, and quantifying how they affect the accuracies of the localization. Given that there have been many indoor localization techniques deployed, an MS may receive simultaneously multiple co-existing estimations on its location. Equipped with the understanding of location errors, we then investigate how to optimally combine, or fuse, all the co-existing estimations of an MS's location. We present computationally-efficient closed-form expressions to fuse the outputs of the estimators. Simulation and experimental results show that our fusion technique achieves higher location accuracy in spite of location errors in the estimators.
{"title":"Error Modeling and Estimation Fusion for Indoor Localization","authors":"Weipeng Zhuo, Bo Zhang, S. Chan, E. Chang","doi":"10.1109/ICME.2012.106","DOIUrl":"https://doi.org/10.1109/ICME.2012.106","url":null,"abstract":"There has been much interest in offering multimedia location-based service (LBS) to indoor users (e.g., sending video/audio streams according to user locations). Offering good LBS largely depends on accurate indoor localization of mobile stations (MSs). To achieve that, in this paper we first model and analyze the error characteristics of important indoor localization schemes, using Radio Frequency Identification (RFID) and Wi-Fi. Our models are simple to use, capturing important system parameters and measurement noises, and quantifying how they affect the accuracies of the localization. Given that there have been many indoor localization techniques deployed, an MS may receive simultaneously multiple co-existing estimations on its location. Equipped with the understanding of location errors, we then investigate how to optimally combine, or fuse, all the co-existing estimations of an MS's location. We present computationally-efficient closed-form expressions to fuse the outputs of the estimators. Simulation and experimental results show that our fusion technique achieves higher location accuracy in spite of location errors in the estimators.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114446645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-view video consists of multiple video sequences captured simultaneously from different angles by multiple closely spaced cameras. It enables the users to freely change their viewpoints by playing different video sequences. Transmission of multi-view video requires more bandwidth than conventional multimedia. To reduce the bandwidth, UDMVT (User Dependent Multi-view Video Transmission) based on MVC (Multi-view Video Coding) has been proposed for single user. In UDMVT, for multiple users the same frames are encoded into different versions for each user, which increases the redundant transmission. For this problem, this paper proposes UMSM (User dependent Multi-view video Streaming for Multi-users). UMSM possesses two characteristics. The first characteristic is that the overlapped frames that are required by multiple users are transmitted only once using the multicast to avoid unnecessary duplication of transmission. The second characteristic is that a time lag of the video request by multiple users is adjusted to coincide with the next request. Simulation results using benchmark test sequences provided by MERL show that UMSM decreases the transmission bit-rate 55.3% on average for 5 users watching the same multi-view video as compared with UDMVT.
多视图视频由多个紧密间隔的摄像机从不同角度同时捕获的多个视频序列组成。它使用户可以通过播放不同的视频序列来自由地改变他们的观点。多视点视频的传输比传统多媒体需要更多的带宽。为了减少带宽,针对单用户提出了基于MVC(多视图视频编码)的用户依赖多视图视频传输(UDMVT)。在UDMVT中,对于多个用户,相同的帧被编码为每个用户的不同版本,这增加了冗余传输。针对这一问题,本文提出了UMSM (User dependent Multi-view video Streaming For Multi-users)。UMSM有两个特点。第一个特点是多个用户需要的重叠帧使用组播只传输一次,避免了不必要的重复传输。第二个特点是多个用户的视频请求的时间延迟被调整为与下一个请求一致。使用MERL提供的基准测试序列进行的仿真结果表明,在5个用户观看相同的多视点视频时,UMSM比UDMVT平均降低了55.3%的传输比特率。
{"title":"Traffic Reduction for Multiple Users in Multi-view Video Streaming","authors":"T. Fujihashi, Ziyuan Pan, Takashi Watanabe","doi":"10.1109/ICME.2012.185","DOIUrl":"https://doi.org/10.1109/ICME.2012.185","url":null,"abstract":"Multi-view video consists of multiple video sequences captured simultaneously from different angles by multiple closely spaced cameras. It enables the users to freely change their viewpoints by playing different video sequences. Transmission of multi-view video requires more bandwidth than conventional multimedia. To reduce the bandwidth, UDMVT (User Dependent Multi-view Video Transmission) based on MVC (Multi-view Video Coding) has been proposed for single user. In UDMVT, for multiple users the same frames are encoded into different versions for each user, which increases the redundant transmission. For this problem, this paper proposes UMSM (User dependent Multi-view video Streaming for Multi-users). UMSM possesses two characteristics. The first characteristic is that the overlapped frames that are required by multiple users are transmitted only once using the multicast to avoid unnecessary duplication of transmission. The second characteristic is that a time lag of the video request by multiple users is adjusted to coincide with the next request. Simulation results using benchmark test sequences provided by MERL show that UMSM decreases the transmission bit-rate 55.3% on average for 5 users watching the same multi-view video as compared with UDMVT.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114505067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose an event-driven black box surveillance camera which reduces energy consumption by waking up the system only when an event is detected and dynamically adjusting the video encoding and the resultant image distortion according to the criticality of captured frames called significance level. To achieve this goal, we find an encoding bitrate minimizing the energy consumption of the camera while satisfying the limited memory space constraint and distortion requirement at each significance level by judiciously allocating bit-rate to each significance level. To do that, we considered the trade-off relations between the total energy consumption vs. encoding bit-rate according to the significance level. For further energy savings, we also proposed a low complexity solution which adjusts the energy-minimal encoding bit-rate based on the dynamically changing event behavior, i.e., timing and duration of events. Experimental results show that the proposed method yields up to 67.49% (49.19% on average) energy savings compared to the conventional bitrate allocation methods.
{"title":"Energy-Aware Operation of Black Box Surveillance Cameras under Event Uncertainty and Memory Constraint","authors":"Giwon Kim, Jungsoo Kim, Jongpil Jung, C. Kyung","doi":"10.1109/ICME.2012.21","DOIUrl":"https://doi.org/10.1109/ICME.2012.21","url":null,"abstract":"In this paper, we propose an event-driven black box surveillance camera which reduces energy consumption by waking up the system only when an event is detected and dynamically adjusting the video encoding and the resultant image distortion according to the criticality of captured frames called significance level. To achieve this goal, we find an encoding bitrate minimizing the energy consumption of the camera while satisfying the limited memory space constraint and distortion requirement at each significance level by judiciously allocating bit-rate to each significance level. To do that, we considered the trade-off relations between the total energy consumption vs. encoding bit-rate according to the significance level. For further energy savings, we also proposed a low complexity solution which adjusts the energy-minimal encoding bit-rate based on the dynamically changing event behavior, i.e., timing and duration of events. Experimental results show that the proposed method yields up to 67.49% (49.19% on average) energy savings compared to the conventional bitrate allocation methods.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114708896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A novel scalable video coding (SVC) scheme is proposed for video transmission over loss networks, which builds on an estimation-theoretic (ET) framework for optimal prediction and error concealment, given all available information from both the current base layer and prior enhancement layer frames. It incorporates a recursive end-to-end distortion estimation technique, namely, the spectral coefficient-wise optimal recursive estimate (SCORE), which accounts for all ET operations and tracks the first and second moments of decoder reconstructed transform coefficients. The overall framework enables optimization of ET-SVC systems for transmission over lossy networks, while accounting for all relevant conditions including the effects of quantization, channel loss, concealment, and error propagation. It thus resolves longstanding difficulties in combining truly optimal prediction and concealment with optimal end-to-end distortion and error-resilient SVC coding decisions. Experiments demonstrate that the proposed scheme offers substantial performance gains over existing error-resilient SVC systems, under a wide range of packet loss and bit rates.
{"title":"A Unified Estimation-Theoretic Framework for Error-Resilient Scalable Video Coding","authors":"Jingning Han, Vinay Melkote, K. Rose","doi":"10.1109/ICME.2012.76","DOIUrl":"https://doi.org/10.1109/ICME.2012.76","url":null,"abstract":"A novel scalable video coding (SVC) scheme is proposed for video transmission over loss networks, which builds on an estimation-theoretic (ET) framework for optimal prediction and error concealment, given all available information from both the current base layer and prior enhancement layer frames. It incorporates a recursive end-to-end distortion estimation technique, namely, the spectral coefficient-wise optimal recursive estimate (SCORE), which accounts for all ET operations and tracks the first and second moments of decoder reconstructed transform coefficients. The overall framework enables optimization of ET-SVC systems for transmission over lossy networks, while accounting for all relevant conditions including the effects of quantization, channel loss, concealment, and error propagation. It thus resolves longstanding difficulties in combining truly optimal prediction and concealment with optimal end-to-end distortion and error-resilient SVC coding decisions. Experiments demonstrate that the proposed scheme offers substantial performance gains over existing error-resilient SVC systems, under a wide range of packet loss and bit rates.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116981544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}