Pub Date : 2021-12-05DOI: 10.1109/vcip53242.2021.9675354
{"title":"Copyright and Reprint Permissions","authors":"","doi":"10.1109/vcip53242.2021.9675354","DOIUrl":"https://doi.org/10.1109/vcip53242.2021.9675354","url":null,"abstract":"","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"3 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133203582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675437
Gerald Xie, Zhu Li, S. Bhattacharyya, A. Mehmood
Object detection is a classic computer vision task, which learns the mapping between an image and object bounding boxes + class labels. Many applications of object detection involve images which are prone to degradation at capture time, notably motion blur from a moving camera like UAVs or object itself. One approach to handling this blur involves using common deblurring methods to recover the clean pixel images and then the apply vision task. This task is typically ill-posed. On top of this, application of these methods also add onto the inference time of the vision network, which can hinder performance of video inputs. To address the issues, we propose a novel plug-and-play (PnP) solution that insert deblurring features into the target vision task network without the need to retrain the task network. The deblur features are learned from a classification loss network on blur strength and directions, and the PnP scheme works well with the object detection network with minimum inference time complexity, compared with the state of the art deblur and then detection solution.
{"title":"Plug-and-Play Deblurring for Robust Object Detection","authors":"Gerald Xie, Zhu Li, S. Bhattacharyya, A. Mehmood","doi":"10.1109/VCIP53242.2021.9675437","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675437","url":null,"abstract":"Object detection is a classic computer vision task, which learns the mapping between an image and object bounding boxes + class labels. Many applications of object detection involve images which are prone to degradation at capture time, notably motion blur from a moving camera like UAVs or object itself. One approach to handling this blur involves using common deblurring methods to recover the clean pixel images and then the apply vision task. This task is typically ill-posed. On top of this, application of these methods also add onto the inference time of the vision network, which can hinder performance of video inputs. To address the issues, we propose a novel plug-and-play (PnP) solution that insert deblurring features into the target vision task network without the need to retrain the task network. The deblur features are learned from a classification loss network on blur strength and directions, and the PnP scheme works well with the object detection network with minimum inference time complexity, compared with the state of the art deblur and then detection solution.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132015696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675360
Honglei Zhang, Francesco Cricri, H. R. Tavakoli, M. Santamaría, Y. Lam, M. Hannuksela
For most machine learning systems, overfitting is an undesired behavior. However, overfitting a model to a test image or a video at inference time is a favorable and effective technique to improve the coding efficiency of learning-based image and video codecs. At the encoding stage, one or more neural networks that are part of the codec are finetuned using the input image or video to achieve a better coding performance. The encoder en-codes the input content into a content bitstream. If the finetuned neural network is part (also) of the decoder, the encoder signals the weight update of the finetuned model to the decoder along with the content bitstream. At the decoding stage, the decoder first updates its neural network model according to the received weight update, and then proceeds with decoding the content bitstream. Since a neural network contains a large number of parameters, compressing the weight update is critical to reducing bitrate overhead. In this paper, we propose learning-based methods to find the important parameters to be overfitted, in terms of rate-distortion performance. Based on simple distribution models for variables in the weight update, we derive two objective functions. By optimizing the proposed objective functions, the importance scores of the parameters can be calculated and the important parameters can be determined. Our experiments on lossless image compression codec show that the proposed method significantly outperforms a prior-art method where overfitted parameters were selected based on heuristics. Furthermore, our technique improved the compression performance of the state-of-the-art lossless image compression codec by 0.1 bit per pixel.
{"title":"Learn to overfit better: finding the important parameters for learned image compression","authors":"Honglei Zhang, Francesco Cricri, H. R. Tavakoli, M. Santamaría, Y. Lam, M. Hannuksela","doi":"10.1109/VCIP53242.2021.9675360","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675360","url":null,"abstract":"For most machine learning systems, overfitting is an undesired behavior. However, overfitting a model to a test image or a video at inference time is a favorable and effective technique to improve the coding efficiency of learning-based image and video codecs. At the encoding stage, one or more neural networks that are part of the codec are finetuned using the input image or video to achieve a better coding performance. The encoder en-codes the input content into a content bitstream. If the finetuned neural network is part (also) of the decoder, the encoder signals the weight update of the finetuned model to the decoder along with the content bitstream. At the decoding stage, the decoder first updates its neural network model according to the received weight update, and then proceeds with decoding the content bitstream. Since a neural network contains a large number of parameters, compressing the weight update is critical to reducing bitrate overhead. In this paper, we propose learning-based methods to find the important parameters to be overfitted, in terms of rate-distortion performance. Based on simple distribution models for variables in the weight update, we derive two objective functions. By optimizing the proposed objective functions, the importance scores of the parameters can be calculated and the important parameters can be determined. Our experiments on lossless image compression codec show that the proposed method significantly outperforms a prior-art method where overfitted parameters were selected based on heuristics. Furthermore, our technique improved the compression performance of the state-of-the-art lossless image compression codec by 0.1 bit per pixel.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"37 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130758778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675407
Shiyi Liu, Zhenyu Wang, Ke Qiu, Jiayu Yang, Ronggang Wang
The third generation of Audio Video Coding Standard (AVS3) achieves 22% coding performance improvement compared with High Efficiency Video Coding (HEVC). However, the improvement of encoding efficiency comes from a more flexible block partition scheme is at the cost of much higher encoding complexity. This paper proposes a bottom-up fast algorithm to prune the time-consuming search process of the CU partition tree. To be specific, we design a scoring mechanism based on the splitting patterns traced back from the bottom to predict the possibility of a partition type to be selected as optimal. The score threshold to skip the exhaustive Rate-Distortion Optimization (RDO) procedure of the partition type is determined by statistical analysis. The experimental results show that the proposed methods can achieve 24.56% time-saving with 0.37% BDBR loss under Random Access configuration and 12.50% complexity reduction with 0.08% BDBR loss under All Intra configuration. The effectiveness leads to the adoption by the open-source platform of AVS3 after evaluated by the AVS working group.
与HEVC (High Efficiency Video Coding)相比,第三代音视频编码标准(AVS3)的编码性能提高了22%。然而,更灵活的块划分方案所带来的编码效率的提高是以更高的编码复杂度为代价的。针对CU分区树耗时的搜索过程,提出了一种自底向上的快速算法。具体来说,我们设计了一种基于从底部追溯的分裂模式的评分机制,以预测被选择为最优分区类型的可能性。通过统计分析确定了跳过分区类型的穷举率失真优化(RDO)过程的得分阈值。实验结果表明,该方法在随机接入配置下可节省24.56%的时间和0.37%的BDBR损耗,在全Intra配置下可降低12.50%的复杂度和0.08%的BDBR损耗。经过AVS工作组的评估,其有效性导致AVS3被开源平台采用。
{"title":"A Bottom-up Fast CU Partition Scoring Mechanism for AVS3","authors":"Shiyi Liu, Zhenyu Wang, Ke Qiu, Jiayu Yang, Ronggang Wang","doi":"10.1109/VCIP53242.2021.9675407","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675407","url":null,"abstract":"The third generation of Audio Video Coding Standard (AVS3) achieves 22% coding performance improvement compared with High Efficiency Video Coding (HEVC). However, the improvement of encoding efficiency comes from a more flexible block partition scheme is at the cost of much higher encoding complexity. This paper proposes a bottom-up fast algorithm to prune the time-consuming search process of the CU partition tree. To be specific, we design a scoring mechanism based on the splitting patterns traced back from the bottom to predict the possibility of a partition type to be selected as optimal. The score threshold to skip the exhaustive Rate-Distortion Optimization (RDO) procedure of the partition type is determined by statistical analysis. The experimental results show that the proposed methods can achieve 24.56% time-saving with 0.37% BDBR loss under Random Access configuration and 12.50% complexity reduction with 0.08% BDBR loss under All Intra configuration. The effectiveness leads to the adoption by the open-source platform of AVS3 after evaluated by the AVS working group.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675422
Franck Galpin, P. Bordes, Thierry Dumas, Pavel Nikitin, F. L. Léannec
This paper presents a learning-based method to improve bi-prediction in video coding. In conventional video coding solutions, the motion compensation of blocks from already decoded reference pictures stands out as the principal tool used to predict the current frame. Especially, the bi-prediction, in which a block is obtained by averaging two different motion-compensated prediction blocks, significantly improves the final temporal prediction accuracy. In this context, we introduce a simple neural network that further improves the blending operation. A complexity balance, both in terms of network size and encoder mode selection, is carried out. Extensive tests on top of the recently standardized VVC codec are performed and show a BD-rate improvement of −1.4% in random access configuration for a network size of fewer than 10k parameters. We also propose a simple CPU-based implementation and direct network quantization to assess the complexity/gains tradeoff in a conventional codec framework.
{"title":"Neural Network based Inter bi-prediction Blending","authors":"Franck Galpin, P. Bordes, Thierry Dumas, Pavel Nikitin, F. L. Léannec","doi":"10.1109/VCIP53242.2021.9675422","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675422","url":null,"abstract":"This paper presents a learning-based method to improve bi-prediction in video coding. In conventional video coding solutions, the motion compensation of blocks from already decoded reference pictures stands out as the principal tool used to predict the current frame. Especially, the bi-prediction, in which a block is obtained by averaging two different motion-compensated prediction blocks, significantly improves the final temporal prediction accuracy. In this context, we introduce a simple neural network that further improves the blending operation. A complexity balance, both in terms of network size and encoder mode selection, is carried out. Extensive tests on top of the recently standardized VVC codec are performed and show a BD-rate improvement of −1.4% in random access configuration for a network size of fewer than 10k parameters. We also propose a simple CPU-based implementation and direct network quantization to assess the complexity/gains tradeoff in a conventional codec framework.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123610571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675451
T. Wang, Chih-Hao Liao, Li-Hsuan Hsieh, A. W. Tsui, Hsin-Chien Huang
In this paper we study techniques for accurate detection, localization, and tracking of multiple people in an indoor scene covered by multiple top-view fisheye cameras. This is a rarely studied setting within the topic of multi-camera object tracking. The experimental results on test videos exhibit good performance for practical use. We also propose methods to account for occlusion by scene objects at different stages of the algorithm that lead to improved results.
{"title":"People Detection and Tracking Using a Fisheye Camera Network","authors":"T. Wang, Chih-Hao Liao, Li-Hsuan Hsieh, A. W. Tsui, Hsin-Chien Huang","doi":"10.1109/VCIP53242.2021.9675451","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675451","url":null,"abstract":"In this paper we study techniques for accurate detection, localization, and tracking of multiple people in an indoor scene covered by multiple top-view fisheye cameras. This is a rarely studied setting within the topic of multi-camera object tracking. The experimental results on test videos exhibit good performance for practical use. We also propose methods to account for occlusion by scene objects at different stages of the algorithm that lead to improved results.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123767854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675339
Jun Fu, Chen Hou, Zhibo Chen
Learning-based image deraining methods have achieved remarkable success in the past few decades. Currently, most deraining architectures are developed by human experts, which is a laborious and error-prone process. In this paper, we present a study on employing neural architecture search (NAS) to automatically design deraining architectures, dubbed AutoDerain. Specifically, we first propose an U-shaped deraining architecture, which mainly consists of residual squeeze-and-excitation blocks (RSEBs). Then, we define a search space, where we search for the convolutional types and the use of the squeeze-and-excitation block. Considering that the differentiable architecture search is memory-intensive, we propose a memory-efficient differentiable architecture search scheme (MDARTS). In light of the success of training binary neural networks, MDARTS optimizes architecture parameters through the proximal gradient, which only consumes the same GPU memory as training a single deraining model. Experimental results demonstrate that the architecture designed by MDARTS is superior to manually designed derainers.
{"title":"AutoDerain: Memory-efficient Neural Architecture Search for Image Deraining","authors":"Jun Fu, Chen Hou, Zhibo Chen","doi":"10.1109/VCIP53242.2021.9675339","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675339","url":null,"abstract":"Learning-based image deraining methods have achieved remarkable success in the past few decades. Currently, most deraining architectures are developed by human experts, which is a laborious and error-prone process. In this paper, we present a study on employing neural architecture search (NAS) to automatically design deraining architectures, dubbed AutoDerain. Specifically, we first propose an U-shaped deraining architecture, which mainly consists of residual squeeze-and-excitation blocks (RSEBs). Then, we define a search space, where we search for the convolutional types and the use of the squeeze-and-excitation block. Considering that the differentiable architecture search is memory-intensive, we propose a memory-efficient differentiable architecture search scheme (MDARTS). In light of the success of training binary neural networks, MDARTS optimizes architecture parameters through the proximal gradient, which only consumes the same GPU memory as training a single deraining model. Experimental results demonstrate that the architecture designed by MDARTS is superior to manually designed derainers.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126701049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675372
Sheng Liu, Liangchen Song, Yi Xu, Junsong Yuan
Existing human models, e.g., SMPL and STAR, represent 3D geometry of a human body in the form of a polygon mesh obtained by deforming a template mesh according to a set of shape and pose parameters. The appearance, however, is not directly modeled by most existing human models. We present a novel 3D human model that faithfully models both the 3D geometry and the appearance of a clothed human body with a continuous volumetric representation, i.e., volume densities and emitted colors of continuous 3D locations in the volume encompassing the human body. In contrast to the mesh-based representation whose resolution is limited by a mesh's fixed number of polygons, our volumetric representation does not limit the resolution of our model. Moreover, our volumetric represen-tation can be rendered via differentiable volume rendering, thus enabling us to train the model only using 2D images (without using ground truth 3D geometries of human bodies) by minimizing a loss function which measures the differences between rendered images and ground truth images. On the contrary, existing human models are trained using ground truth 3D geometries of human bodies. Thanks to the ability of our model to jointly model both the geometries and the appearances of clothed people, our model can benefit applications including human image synthesis, gaming and 3D television and telepresence.
{"title":"NeCH: Neural Clothed Human Model","authors":"Sheng Liu, Liangchen Song, Yi Xu, Junsong Yuan","doi":"10.1109/VCIP53242.2021.9675372","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675372","url":null,"abstract":"Existing human models, e.g., SMPL and STAR, represent 3D geometry of a human body in the form of a polygon mesh obtained by deforming a template mesh according to a set of shape and pose parameters. The appearance, however, is not directly modeled by most existing human models. We present a novel 3D human model that faithfully models both the 3D geometry and the appearance of a clothed human body with a continuous volumetric representation, i.e., volume densities and emitted colors of continuous 3D locations in the volume encompassing the human body. In contrast to the mesh-based representation whose resolution is limited by a mesh's fixed number of polygons, our volumetric representation does not limit the resolution of our model. Moreover, our volumetric represen-tation can be rendered via differentiable volume rendering, thus enabling us to train the model only using 2D images (without using ground truth 3D geometries of human bodies) by minimizing a loss function which measures the differences between rendered images and ground truth images. On the contrary, existing human models are trained using ground truth 3D geometries of human bodies. Thanks to the ability of our model to jointly model both the geometries and the appearances of clothed people, our model can benefit applications including human image synthesis, gaming and 3D television and telepresence.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126959061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675393
Shan-zhi Shi, Cheolkon Jung
In this paper, we propose deep metric learning for human action recognition with SlowFast networks. We adopt SlowFast Networks to extract slow-changing spatial semantic information of a single target entity in the spatial domain with fast-changing motion information in the temporal domain. Since deep metric learning is able to learn the class difference between human actions, we utilize deep metric learning to learn a mapping from the original video to the compact features in the embedding space. The proposed network consists of three main parts: 1) two branches independently operating at low and high frame rates to extract spatial and temporal features; 2) feature fusion of the two branches; 3) joint training network of deep metric learning and classification loss. Experimental results on the KTH human action dataset demonstrate that the proposed method achieves faster runtime with less model size than C3D and R3D, while ensuring high accuracy.
{"title":"Deep Metric Learning for Human Action Recognition with SlowFast Networks","authors":"Shan-zhi Shi, Cheolkon Jung","doi":"10.1109/VCIP53242.2021.9675393","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675393","url":null,"abstract":"In this paper, we propose deep metric learning for human action recognition with SlowFast networks. We adopt SlowFast Networks to extract slow-changing spatial semantic information of a single target entity in the spatial domain with fast-changing motion information in the temporal domain. Since deep metric learning is able to learn the class difference between human actions, we utilize deep metric learning to learn a mapping from the original video to the compact features in the embedding space. The proposed network consists of three main parts: 1) two branches independently operating at low and high frame rates to extract spatial and temporal features; 2) feature fusion of the two branches; 3) joint training network of deep metric learning and classification loss. Experimental results on the KTH human action dataset demonstrate that the proposed method achieves faster runtime with less model size than C3D and R3D, while ensuring high accuracy.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114432473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-05DOI: 10.1109/VCIP53242.2021.9675314
Evgeniy Upenik, Michela Testolina, J. Ascenso, Fernando Pereira, T. Ebrahimi
Learning-based image codecs produce different compression artifacts, when compared to the blocking and blurring degradation introduced by conventional image codecs, such as JPEG, JPEG 2000 and HEIC. In this paper, a crowdsourcing based subjective quality evaluation procedure was used to benchmark a representative set of end-to-end deep learning-based image codecs submitted to the MMSP'2020 Grand Challenge on Learning-Based Image Coding and the JPEG AI Call for Evidence. For the first time, a double stimulus methodology with a continuous quality scale was applied to evaluate this type of image codecs. The subjective experiment is one of the largest ever reported including more than 240 pair-comparisons evaluated by 118 naïve subjects. The results of the benchmarking of learning-based image coding solutions against conventional codecs are organized in a dataset of differential mean opinion scores along with the stimuli and made publicly available.
{"title":"Large-Scale Crowdsourcing Subjective Quality Evaluation of Learning-Based Image Coding","authors":"Evgeniy Upenik, Michela Testolina, J. Ascenso, Fernando Pereira, T. Ebrahimi","doi":"10.1109/VCIP53242.2021.9675314","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675314","url":null,"abstract":"Learning-based image codecs produce different compression artifacts, when compared to the blocking and blurring degradation introduced by conventional image codecs, such as JPEG, JPEG 2000 and HEIC. In this paper, a crowdsourcing based subjective quality evaluation procedure was used to benchmark a representative set of end-to-end deep learning-based image codecs submitted to the MMSP'2020 Grand Challenge on Learning-Based Image Coding and the JPEG AI Call for Evidence. For the first time, a double stimulus methodology with a continuous quality scale was applied to evaluate this type of image codecs. The subjective experiment is one of the largest ever reported including more than 240 pair-comparisons evaluated by 118 naïve subjects. The results of the benchmarking of learning-based image coding solutions against conventional codecs are organized in a dataset of differential mean opinion scores along with the stimuli and made publicly available.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114636275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}