Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675323
Yizhan Zhao, Sumei Li, Yongli Chang
Recently, scene text detection based on deep learning has progressed substantially. Nevertheless, most previous models with FPN are limited by the drawback of sample interpolation algorithms, which fail to generate high-quality up-sampled features. Accordingly, we propose an end-to-end trainable text detector to alleviate the above dilemma. Specifically, a Back Projection Enhanced Up-sampling (BPEU) block is proposed to alleviate the drawback of sample interpolation algorithms. It significantly enhances the quality of up-sampled features by employing back projection and detail compensation. Further-more, a Multi-Dimensional Attention (MDA) block is devised to learn different knowledge from spatial and channel dimensions, which intelligently selects features to generate more discriminative representations. Experimental results on three benchmarks, ICDAR2015, ICDAR2017- MLT and MSRA-TD500, demonstrate the effectiveness of our method.
{"title":"Multi-Dimension Aware Back Projection Network For Scene Text Detection","authors":"Yizhan Zhao, Sumei Li, Yongli Chang","doi":"10.1109/VCIP53242.2021.9675323","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675323","url":null,"abstract":"Recently, scene text detection based on deep learning has progressed substantially. Nevertheless, most previous models with FPN are limited by the drawback of sample interpolation algorithms, which fail to generate high-quality up-sampled features. Accordingly, we propose an end-to-end trainable text detector to alleviate the above dilemma. Specifically, a Back Projection Enhanced Up-sampling (BPEU) block is proposed to alleviate the drawback of sample interpolation algorithms. It significantly enhances the quality of up-sampled features by employing back projection and detail compensation. Further-more, a Multi-Dimensional Attention (MDA) block is devised to learn different knowledge from spatial and channel dimensions, which intelligently selects features to generate more discriminative representations. Experimental results on three benchmarks, ICDAR2015, ICDAR2017- MLT and MSRA-TD500, demonstrate the effectiveness of our method.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130606417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper addresses image rescaling, the task of which is to downscale an input image followed by upscaling for the purposes of transmission, storage, or playback on heterogeneous devices. The state-of-the-art image rescaling network (known as IRN) tackles image downscaling and upscaling as mutually invertible tasks using invertible affine coupling layers. In particular, for upscaling, IRN models the missing high-frequency component by an input-independent (case-agnostic) Gaussian noise. In this work, we take one step further to predict a case-specific high-frequency component from textures embedded in the downscaled image. Moreover, we adopt integer coupling layers to avoid quantizing the downscaled image. When tested on commonly used datasets, the proposed method, termed DIRECT, improves high-resolution reconstruction quality both subjectively and objectively, while maintaining visually pleasing downscaled images.
{"title":"DIRECT: Discrete Image Rescaling with Enhancement from Case-specific Textures","authors":"Yan-An Chen, Ching-Chun Hsiao, Wen-Hsiao Peng, Ching-Chun Huang","doi":"10.1109/VCIP53242.2021.9675420","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675420","url":null,"abstract":"This paper addresses image rescaling, the task of which is to downscale an input image followed by upscaling for the purposes of transmission, storage, or playback on heterogeneous devices. The state-of-the-art image rescaling network (known as IRN) tackles image downscaling and upscaling as mutually invertible tasks using invertible affine coupling layers. In particular, for upscaling, IRN models the missing high-frequency component by an input-independent (case-agnostic) Gaussian noise. In this work, we take one step further to predict a case-specific high-frequency component from textures embedded in the downscaled image. Moreover, we adopt integer coupling layers to avoid quantizing the downscaled image. When tested on commonly used datasets, the proposed method, termed DIRECT, improves high-resolution reconstruction quality both subjectively and objectively, while maintaining visually pleasing downscaled images.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133772093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675383
Benben Niu, Ziwei Wei, Yun He
With the emergence of various machine-to-machine and machine-to-human tasks with deep learning, the amount of deep feature data is increasing. Deep product quantization is widely applied in deep feature retrieval tasks and has achieved good accuracy. However, it does not focus on the compression target primarily, and its output is a fixed-length quantization index, which is not suitable for subsequent compression. In this paper, we propose an entropy-based deep product quantization algorithm for deep feature compression. Firstly, it introduces entropy into hard and soft quantization strategies, which can adapt to the codebook optimization and codeword determination operations in the training and testing processes respectively. Secondly, the loss functions related to entropy are designed to adjust the distribution of quantization index, so that it can accommodate to the subsequent entropy coding module. Experimental results carried on retrieval tasks show that the proposed method can be generally combined with deep product quantization and its extended schemes, and can achieve a better compression performance under near lossless condition.
{"title":"Entropy-based Deep Product Quantization for Visual Search and Deep Feature Compression","authors":"Benben Niu, Ziwei Wei, Yun He","doi":"10.1109/VCIP53242.2021.9675383","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675383","url":null,"abstract":"With the emergence of various machine-to-machine and machine-to-human tasks with deep learning, the amount of deep feature data is increasing. Deep product quantization is widely applied in deep feature retrieval tasks and has achieved good accuracy. However, it does not focus on the compression target primarily, and its output is a fixed-length quantization index, which is not suitable for subsequent compression. In this paper, we propose an entropy-based deep product quantization algorithm for deep feature compression. Firstly, it introduces entropy into hard and soft quantization strategies, which can adapt to the codebook optimization and codeword determination operations in the training and testing processes respectively. Secondly, the loss functions related to entropy are designed to adjust the distribution of quantization index, so that it can accommodate to the subsequent entropy coding module. Experimental results carried on retrieval tasks show that the proposed method can be generally combined with deep product quantization and its extended schemes, and can achieve a better compression performance under near lossless condition.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131067395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675337
Hua Lin, Hongtian Zhao, Hua Yang
Events in videos usually contain a variety of factors: objects, environments, actions, and their interaction relations, and these factors as the mid-level semantics can bridge the gap between the event categories and the video clips. In this paper, we present a novel video events recognition method that uses the graph convolution networks to represent and reason the logic relations among the inner factors. Considering that different kinds of events may focus on different factors, we especially use the transformer networks to extract the spatial-temporal features drawing upon the attention mechanism that can adaptively assign weights to concerned key factors. Although transformers generally rely more on large datasets, we show the effectiveness of applying a 2D convolution backbone before the transformers. We train and test our framework on the challenging video event recognition dataset UCF-Crime and conduct ablation studies. The experimental results show that our method achieves state-of-the-art performance, outperforming previous principal advanced models with a significant margin of recognition accuracy.
{"title":"Complex Event Recognition via Spatial-Temporal Relation Graph Reasoning","authors":"Hua Lin, Hongtian Zhao, Hua Yang","doi":"10.1109/VCIP53242.2021.9675337","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675337","url":null,"abstract":"Events in videos usually contain a variety of factors: objects, environments, actions, and their interaction relations, and these factors as the mid-level semantics can bridge the gap between the event categories and the video clips. In this paper, we present a novel video events recognition method that uses the graph convolution networks to represent and reason the logic relations among the inner factors. Considering that different kinds of events may focus on different factors, we especially use the transformer networks to extract the spatial-temporal features drawing upon the attention mechanism that can adaptively assign weights to concerned key factors. Although transformers generally rely more on large datasets, we show the effectiveness of applying a 2D convolution backbone before the transformers. We train and test our framework on the challenging video event recognition dataset UCF-Crime and conduct ablation studies. The experimental results show that our method achieves state-of-the-art performance, outperforming previous principal advanced models with a significant margin of recognition accuracy.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133499813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675435
Antonin Gilles
Thanks to its ability to provide accurate focus cues, Holography is considered as a promising display technology for augmented reality glasses. However, since it contains a large amount of data, the calculation of a hologram is a time-consuming process which results in prohibiting head-motion-to-photon latency, especially when using embedded calculation hardware. In this paper, we present a real-time hologram calculation method implemented on a NVIDIA Jetson AGX Xavier embedded platform. Our method is based on two modules: an offline pre-computation module and an on-the-fly hologram synthesis module. In the offline calculation module, the omnidirectional light field scattered by each scene object is individually pre-computed and stored in a Look-Up Table (LUT). Then, in the hologram synthesis module, the light waves corresponding to the viewer's position and orientation are extracted from the LUT in real-time to compute the hologram. Experimental results show that the proposed method is able to compute 2K1K color holograms at more than 50 frames per second, enabling its use in augmented reality applications.
{"title":"Real-time embedded hologram calculation for augmented reality glasses","authors":"Antonin Gilles","doi":"10.1109/VCIP53242.2021.9675435","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675435","url":null,"abstract":"Thanks to its ability to provide accurate focus cues, Holography is considered as a promising display technology for augmented reality glasses. However, since it contains a large amount of data, the calculation of a hologram is a time-consuming process which results in prohibiting head-motion-to-photon latency, especially when using embedded calculation hardware. In this paper, we present a real-time hologram calculation method implemented on a NVIDIA Jetson AGX Xavier embedded platform. Our method is based on two modules: an offline pre-computation module and an on-the-fly hologram synthesis module. In the offline calculation module, the omnidirectional light field scattered by each scene object is individually pre-computed and stored in a Look-Up Table (LUT). Then, in the hologram synthesis module, the light waves corresponding to the viewer's position and orientation are extracted from the LUT in real-time to compute the hologram. Experimental results show that the proposed method is able to compute 2K1K color holograms at more than 50 frames per second, enabling its use in augmented reality applications.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131927286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675342
Yosuke Ueki, M. Ikehara
Underwater images suffer from low contrast, color distortion and visibility degradation due to the light scattering and attenuation. Over the past few years, the importance of underwater image enhancement has increased because of ocean engineering and underwater robotics. Existing underwater image enhancement methods are based on various assumptions. However, it is almost impossible to define appropriate assumptions for underwater images due to the diversity of underwater images. Therefore, they are only effective for specific types of underwater images. Recently, underwater image enhancement algorisms using CNNs and GANS have been proposed, but they are not as advanced as other image processing methods due to the lack of suitable training data sets and the complexity of the issues. To solve the problems, we propose a novel underwater image enhancement method which combines the residual feature attention block and novel combination of multi-scale and multi-patch structure. Multi-patch network extracts local features to adjust to various underwater images which are often Non-homogeneous. In addition, our network includes multi-scale network which is often effective for image restoration. Experimental results show that our proposed method outperforms the conventional method for various types of images.
{"title":"Underwater Image Enhancement with Multi-Scale Residual Attention Network","authors":"Yosuke Ueki, M. Ikehara","doi":"10.1109/VCIP53242.2021.9675342","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675342","url":null,"abstract":"Underwater images suffer from low contrast, color distortion and visibility degradation due to the light scattering and attenuation. Over the past few years, the importance of underwater image enhancement has increased because of ocean engineering and underwater robotics. Existing underwater image enhancement methods are based on various assumptions. However, it is almost impossible to define appropriate assumptions for underwater images due to the diversity of underwater images. Therefore, they are only effective for specific types of underwater images. Recently, underwater image enhancement algorisms using CNNs and GANS have been proposed, but they are not as advanced as other image processing methods due to the lack of suitable training data sets and the complexity of the issues. To solve the problems, we propose a novel underwater image enhancement method which combines the residual feature attention block and novel combination of multi-scale and multi-patch structure. Multi-patch network extracts local features to adjust to various underwater images which are often Non-homogeneous. In addition, our network includes multi-scale network which is often effective for image restoration. Experimental results show that our proposed method outperforms the conventional method for various types of images.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134552658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-component prediction has great potential for removing the redundancy of multi-components. Recently, cross-component sample adaptive offset (CCSAO) was adopted in the third generation of Audio Video coding Standard (AVS3), which utilizes the intensities of co-located luma samples to determine the offsets of chroma sample filters. However, the frame-level based offset is rough for various content, and the edge information of classified samples is ignored. In this paper, we propose an enhanced CCSAO (ECCSAO) method to further improve the coding performance. Firstly, four selectable 1-D directional patterns are added to make the mapping between luma and chroma components more effectively. Secondly, one four-layer quad-tree based structure is designed to improve the filtering flexibility of CCSAO. Experimental results show that the proposed approach achieves 1.51%, 2.33% and 2.68% BD-rate savings for All-Intra (AI), Random-Access (RA) and Low Delay B (LD) configurations compared to AVS3 reference software, respectively. A subset improvement of ECCSAO has been adopted by AVS3.
跨分量预测在消除多分量冗余方面具有很大的潜力。最近,第三代音视频编码标准(AVS3)采用了交叉分量样本自适应偏移(CCSAO),利用同位亮度样本的强度来确定色度样本滤波器的偏移量。然而,基于帧级的偏移量对于各种内容来说是粗糙的,并且忽略了分类样本的边缘信息。为了进一步提高编码性能,本文提出了一种增强的CCSAO (ECCSAO)方法。首先,增加四个可选择的一维方向模式,使亮度和色度分量之间的映射更有效。其次,设计了一种基于四层四叉树的结构,提高了CCSAO的滤波灵活性;实验结果表明,与AVS3参考软件相比,该方法在All-Intra (AI)、Random-Access (RA)和Low Delay B (LD)配置下分别节省了1.51%、2.33%和2.68%的传输速率。AVS3采用了对ECCSAO的子集改进。
{"title":"Enhanced Cross Component Sample Adaptive Offset for AVS3","authors":"Yunrui Jian, Jiaqi Zhang, Junru Li, Suhong Wang, Shanshe Wang, Siwei Ma, Wen Gao","doi":"10.1109/VCIP53242.2021.9675321","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675321","url":null,"abstract":"Cross-component prediction has great potential for removing the redundancy of multi-components. Recently, cross-component sample adaptive offset (CCSAO) was adopted in the third generation of Audio Video coding Standard (AVS3), which utilizes the intensities of co-located luma samples to determine the offsets of chroma sample filters. However, the frame-level based offset is rough for various content, and the edge information of classified samples is ignored. In this paper, we propose an enhanced CCSAO (ECCSAO) method to further improve the coding performance. Firstly, four selectable 1-D directional patterns are added to make the mapping between luma and chroma components more effectively. Secondly, one four-layer quad-tree based structure is designed to improve the filtering flexibility of CCSAO. Experimental results show that the proposed approach achieves 1.51%, 2.33% and 2.68% BD-rate savings for All-Intra (AI), Random-Access (RA) and Low Delay B (LD) configurations compared to AVS3 reference software, respectively. A subset improvement of ECCSAO has been adopted by AVS3.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"2 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126326869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675330
Alex Kreinis, Tom Damri, Tomer Leon, Marina Litvak, Irina Rabaev
Autism spectrum disorder (ASD) is frequently ac-companied by impairment in emotional expression recognition, and therefore individuals with ASD may find it hard to interpret emotions and interact. Inspired by this fact, we developed a web-based video chat to assist people with ASD, both for real-time recognition of facial emotions and for practicing. This real-time application detects the speaker's face in a video stream and classifies the expressed emotion into one of the seven categories: neutral, surprise, happy, angry, disgust, fear, and sad. The classification is then displayed as the text label below the speaker's face. We developed this application as a part of the undergraduate project for the B.Sc. degree in Software Engineering. Its development and testing were made with the cooperation of the local society for children and adults with autism. The application has been released for unrestricted use on https://telemojii.herokuapp.com/. The demo is available at http://www.filedropper.com/telemojishortdemoblur.
{"title":"Telemoji: A video chat with automated recognition of facial expressions","authors":"Alex Kreinis, Tom Damri, Tomer Leon, Marina Litvak, Irina Rabaev","doi":"10.1109/VCIP53242.2021.9675330","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675330","url":null,"abstract":"Autism spectrum disorder (ASD) is frequently ac-companied by impairment in emotional expression recognition, and therefore individuals with ASD may find it hard to interpret emotions and interact. Inspired by this fact, we developed a web-based video chat to assist people with ASD, both for real-time recognition of facial emotions and for practicing. This real-time application detects the speaker's face in a video stream and classifies the expressed emotion into one of the seven categories: neutral, surprise, happy, angry, disgust, fear, and sad. The classification is then displayed as the text label below the speaker's face. We developed this application as a part of the undergraduate project for the B.Sc. degree in Software Engineering. Its development and testing were made with the cooperation of the local society for children and adults with autism. The application has been released for unrestricted use on https://telemojii.herokuapp.com/. The demo is available at http://www.filedropper.com/telemojishortdemoblur.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"45 Suppl 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126390728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675380
Fan Jiang, Xin Jin, Kedeng Tong
Plenoptic 2.0 videos that record time-varying light fields by focused plenoptic cameras are prospective to immersive visual applications due to capturing dense sampled light fields with high spatial resolution in the rendered sub-apertures. In this paper, an intra prediction method is proposed for compressing multi-focus plenoptic 2.0 videos efficiently. Based on the estimation of zooming factor, novel gradient-feature-based zooming, adaptive-bilinear-interpolation-based tailoring and inverse-gradient-based boundary filtering are proposed and executed sequentially to generate accurate prediction candidates for weighted prediction working with adaptive skipping strategy. Experimental results demonstrate the superior performance of the proposed method relative to HEVC and state-of-the-art methods.
{"title":"Pixel Gradient Based Zooming Method for Plenoptic Intra Prediction","authors":"Fan Jiang, Xin Jin, Kedeng Tong","doi":"10.1109/VCIP53242.2021.9675380","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675380","url":null,"abstract":"Plenoptic 2.0 videos that record time-varying light fields by focused plenoptic cameras are prospective to immersive visual applications due to capturing dense sampled light fields with high spatial resolution in the rendered sub-apertures. In this paper, an intra prediction method is proposed for compressing multi-focus plenoptic 2.0 videos efficiently. Based on the estimation of zooming factor, novel gradient-feature-based zooming, adaptive-bilinear-interpolation-based tailoring and inverse-gradient-based boundary filtering are proposed and executed sequentially to generate accurate prediction candidates for weighted prediction working with adaptive skipping strategy. Experimental results demonstrate the superior performance of the proposed method relative to HEVC and state-of-the-art methods.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"2022 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127601765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675345
Guangjie Ren, Zizheng Liu, Zhenzhong Chen, Shan Liu
In this paper, we propose a reinforcement learning based region of interest (ROI) bit allocation method for gaming video coding in Versatile Video Coding (VVC). Most current ROI-based bit allocation methods rely on bit budgets based on frame-level empirical weight allocation. The restricted bit budgets influence the efficiency of ROI-based bit allocation and the stability of video quality. To address this issue, the bit allocation process of frame and ROI are combined and formulated as a Markov decision process (MDP). A deep reinforcement learning (RL) method is adopted to solve this problem and obtain the appropriate bits of frame and ROI. Our target is to improve the quality of ROI and reduce the frame-level quality fluctuation, whilst satisfying the bit budgets constraint. The RL-based ROI bit allocation method is implemented in the latest video coding standard and verified for gaming video coding. The experimental results demonstrate that the proposed method achieves a better quality of ROI while reducing the quality fluctuation compared to the reference methods.
{"title":"Reinforcement Learning based ROI Bit Allocation for Gaming Video Coding in VVC","authors":"Guangjie Ren, Zizheng Liu, Zhenzhong Chen, Shan Liu","doi":"10.1109/VCIP53242.2021.9675345","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675345","url":null,"abstract":"In this paper, we propose a reinforcement learning based region of interest (ROI) bit allocation method for gaming video coding in Versatile Video Coding (VVC). Most current ROI-based bit allocation methods rely on bit budgets based on frame-level empirical weight allocation. The restricted bit budgets influence the efficiency of ROI-based bit allocation and the stability of video quality. To address this issue, the bit allocation process of frame and ROI are combined and formulated as a Markov decision process (MDP). A deep reinforcement learning (RL) method is adopted to solve this problem and obtain the appropriate bits of frame and ROI. Our target is to improve the quality of ROI and reduce the frame-level quality fluctuation, whilst satisfying the bit budgets constraint. The RL-based ROI bit allocation method is implemented in the latest video coding standard and verified for gaming video coding. The experimental results demonstrate that the proposed method achieves a better quality of ROI while reducing the quality fluctuation compared to the reference methods.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"15 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121005484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}