首页 > 最新文献

2021 International Conference on Visual Communications and Image Processing (VCIP)最新文献

英文 中文
A Video Dataset for Learning-based Visual Data Compression and Analysis 基于学习的视觉数据压缩与分析的视频数据集
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675343
Xiaozhong Xu, Shan Liu, Zeqi Li
Learning-based visual data compression and analysis have attracted great interest from both academia and industry recently. More training as well as testing datasets, especially good quality video datasets are highly desirable for related research and standardization activities. A UHD video dataset, referred to as Tencent Video Dataset (TVD), is established to serve various purposes such as training neural network-based coding tools and testing machine vision tasks including object detection and segmentation. This dataset contains 86 video sequences with a variety of content coverage. Each video sequence consists of 65 frames at 4K (3840x2160) spatial resolution. In this paper, the details of this dataset, as well as its performance when compressed by VVC and HEVC video codecs, are introduced.
基于学习的视觉数据压缩与分析近年来引起了学术界和工业界的极大兴趣。更多的训练和测试数据集,特别是高质量的视频数据集是相关研究和标准化活动非常需要的。建立了一个UHD视频数据集,称为腾讯视频数据集(TVD),用于各种目的,如训练基于神经网络的编码工具和测试机器视觉任务,包括对象检测和分割。该数据集包含86个具有各种内容覆盖的视频序列。每个视频序列由65帧4K (3840x2160)空间分辨率组成。本文介绍了该数据集的细节,以及它在VVC和HEVC视频编解码器压缩后的性能。
{"title":"A Video Dataset for Learning-based Visual Data Compression and Analysis","authors":"Xiaozhong Xu, Shan Liu, Zeqi Li","doi":"10.1109/VCIP53242.2021.9675343","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675343","url":null,"abstract":"Learning-based visual data compression and analysis have attracted great interest from both academia and industry recently. More training as well as testing datasets, especially good quality video datasets are highly desirable for related research and standardization activities. A UHD video dataset, referred to as Tencent Video Dataset (TVD), is established to serve various purposes such as training neural network-based coding tools and testing machine vision tasks including object detection and segmentation. This dataset contains 86 video sequences with a variety of content coverage. Each video sequence consists of 65 frames at 4K (3840x2160) spatial resolution. In this paper, the details of this dataset, as well as its performance when compressed by VVC and HEVC video codecs, are introduced.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121390621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Video Coding Pre-Processing Based on Rate-Distortion Optimized Weighted Guided Filter 基于率失真优化加权引导滤波器的视频编码预处理
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675444
Xi Huang, Luheng Jia, Han Wang, Ke-bin Jia
In video coding, it is always an intractable problem to compress high frequency components including noise and visually imperceptible content that consumes large amount bandwidth resources while providing limited quality improvement. Direct using of denoising methods causes coding performance degradation, and hence not suitable for video coding scenario. In this work, we propose a video pre-processing approach by leveraging edge preserving filter specifically designed for video coding, of which filter parameters are optimized in the sense of rate-distortion (R-D) performance. The proposed pre-processing method removes low R-D cost-effective components for video encoder while keeping important structural components, leading to higher coding efficiency and also better subjective quality. Comparing with the conventional denoising filters, our proposed pre-processing method using the R-D optimized edge preserving filter can improve the coding efficiency by up to −5.2% BD-rate with low computational complexity.
在视频编码中,压缩包括噪声和视觉上难以察觉的内容在内的高频成分一直是一个棘手的问题,这些高频成分消耗了大量的带宽资源,而提高的质量却有限。直接使用去噪方法会导致编码性能下降,因此不适合视频编码场景。在这项工作中,我们提出了一种视频预处理方法,利用专为视频编码设计的边缘保持滤波器,其中滤波器参数在率失真(R-D)性能方面进行了优化。该预处理方法在保留重要结构成分的同时,去除了视频编码器中低R-D性价比的成分,提高了编码效率和主观质量。与传统的去噪滤波器相比,采用R-D优化的边缘保持滤波器的预处理方法可以将编码效率提高- 5.2%,且计算复杂度低。
{"title":"Video Coding Pre-Processing Based on Rate-Distortion Optimized Weighted Guided Filter","authors":"Xi Huang, Luheng Jia, Han Wang, Ke-bin Jia","doi":"10.1109/VCIP53242.2021.9675444","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675444","url":null,"abstract":"In video coding, it is always an intractable problem to compress high frequency components including noise and visually imperceptible content that consumes large amount bandwidth resources while providing limited quality improvement. Direct using of denoising methods causes coding performance degradation, and hence not suitable for video coding scenario. In this work, we propose a video pre-processing approach by leveraging edge preserving filter specifically designed for video coding, of which filter parameters are optimized in the sense of rate-distortion (R-D) performance. The proposed pre-processing method removes low R-D cost-effective components for video encoder while keeping important structural components, leading to higher coding efficiency and also better subjective quality. Comparing with the conventional denoising filters, our proposed pre-processing method using the R-D optimized edge preserving filter can improve the coding efficiency by up to −5.2% BD-rate with low computational complexity.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130529305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallelized Context Modeling for Faster Image Coding 并行上下文建模用于更快的图像编码
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675377
A. B. Koyuncu, Kai Cui, A. Boev, E. Steinbach
Learning-based image compression has reached the performance of classical methods such as BPG. One common approach is to use an autoencoder network to map the pixel information to a latent space and then approximate the symbol probabilities in that space with a context model. During inference, the learned context model provides symbol probabilities, which are used by the entropy encoder to obtain the bitstream. Currently, the most effective context models use autoregression, but autoregression results in a very high decoding complexity due to the serialized data processing. In this work, we propose a method to parallelize the autoregressive process used for image compression. In our experiments, we achieve a decoding speed that is over 8 times faster than the standard autoregressive context model almost without compression performance reduction.
基于学习的图像压缩已经达到了经典方法(如BPG)的性能。一种常见的方法是使用自动编码器网络将像素信息映射到潜在空间,然后使用上下文模型近似该空间中的符号概率。在推理过程中,学习到的上下文模型提供符号概率,熵编码器使用这些概率来获得比特流。目前,最有效的上下文模型使用自回归,但自回归由于数据处理序列化导致解码复杂度很高。在这项工作中,我们提出了一种并行化用于图像压缩的自回归过程的方法。在我们的实验中,我们实现了比标准自回归上下文模型快8倍以上的解码速度,几乎没有压缩性能降低。
{"title":"Parallelized Context Modeling for Faster Image Coding","authors":"A. B. Koyuncu, Kai Cui, A. Boev, E. Steinbach","doi":"10.1109/VCIP53242.2021.9675377","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675377","url":null,"abstract":"Learning-based image compression has reached the performance of classical methods such as BPG. One common approach is to use an autoencoder network to map the pixel information to a latent space and then approximate the symbol probabilities in that space with a context model. During inference, the learned context model provides symbol probabilities, which are used by the entropy encoder to obtain the bitstream. Currently, the most effective context models use autoregression, but autoregression results in a very high decoding complexity due to the serialized data processing. In this work, we propose a method to parallelize the autoregressive process used for image compression. In our experiments, we achieve a decoding speed that is over 8 times faster than the standard autoregressive context model almost without compression performance reduction.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116848871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Cross-Component Sample Offset for Image and Video Coding 图像和视频编码的跨分量样本偏移
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675355
Yixin Du, Xin Zhao, Shanchun Liu
Existing cross-component video coding technologies have shown great potential on improving coding efficiency. The fundamental insight of cross-component coding technology is respecting the statistical correlations among different color components. In this paper, a Cross-Component Sample Offset (CCSO) approach for image and video coding is proposed inspired by the observation that, luma component tends to contain more texture, while chroma component is relatively smoother. The key component of CCSO is a non-linear offset mapping mechanism implemented as a look-up-table (LUT). The input of the mapping is the co-located reconstructed samples of luma component, and the output is offset values applied on chroma component. The proposed method has been implemented on top of a recent version of libaom. Experimental results show that the proposed approach brings 1.16% Random Access (RA) BD-rate saving on top of AV1 with marginal encoding/decoding time increase.
现有的跨分量视频编码技术在提高编码效率方面显示出巨大的潜力。跨分量编码技术的基本思想是尊重不同颜色分量之间的统计相关性。本文根据亮度分量往往包含较多的纹理,而色度分量相对光滑的特点,提出了一种用于图像和视频编码的跨分量样本偏移(Cross-Component Sample Offset, CCSO)方法。CCSO的关键组件是一个非线性偏移映射机制,实现为一个查找表(LUT)。映射的输入是亮度分量的共定位重构样本,输出是色度分量上的偏移值。所提出的方法已经在libaom的最新版本上实现。实验结果表明,该方法在AV1的基础上可节省1.16%的随机存取(RA) bd率,且编解码时间边际增加。
{"title":"Cross-Component Sample Offset for Image and Video Coding","authors":"Yixin Du, Xin Zhao, Shanchun Liu","doi":"10.1109/VCIP53242.2021.9675355","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675355","url":null,"abstract":"Existing cross-component video coding technologies have shown great potential on improving coding efficiency. The fundamental insight of cross-component coding technology is respecting the statistical correlations among different color components. In this paper, a Cross-Component Sample Offset (CCSO) approach for image and video coding is proposed inspired by the observation that, luma component tends to contain more texture, while chroma component is relatively smoother. The key component of CCSO is a non-linear offset mapping mechanism implemented as a look-up-table (LUT). The input of the mapping is the co-located reconstructed samples of luma component, and the output is offset values applied on chroma component. The proposed method has been implemented on top of a recent version of libaom. Experimental results show that the proposed approach brings 1.16% Random Access (RA) BD-rate saving on top of AV1 with marginal encoding/decoding time increase.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132858740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Multi-dimensional Aesthetic Quality Assessment Model for Mobile Game Images 手机游戏图像的多维审美质量评价模型
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675430
Tao Wang, Wei Sun, Xiongkuo Min, Wei Lu, Zicheng Zhang, Guangtao Zhai
With the development of the game industry and the popularization of mobile devices, mobile games have played an important role in people's entertainment life. The aesthetic quality of mobile game images determines the users' Quality of Experience (QoE) to a certain extent. In this paper, we propose a multi-task deep learning based method to evaluate the aesthetic quality of mobile game images in multiple dimensions (i.e. the fineness, color harmony, colorfulness, and overall quality). Specifically, we first extract the quality-aware feature representation through integrating the features from all intermediate layers of the convolution neural network (CNN) and then map these quality-aware features into the quality score space in each dimension via the quality regressor module, which consists of three fully connected (FC) layers. The proposed model is trained through a multi-task learning manner, where the quality-aware features are shared by different quality dimension prediction tasks, and the multi-dimensional quality scores of each image are regressed by multiple quality regression modules respectively. We further introduce an uncertainty principle to balance the loss of each task in the training stage. The experimental results show that our proposed model achieves the best performance on the Multi-dimensional Aesthetic assessment for Mobile Game image database (MAMG) among state-of-the-art image quality assessment (IQA) algorithms and aesthetic quality assessment (AQA) algorithms.
随着游戏产业的发展和移动设备的普及,手机游戏在人们的娱乐生活中扮演了重要的角色。手机游戏图像的审美质量在一定程度上决定了用户的体验质量。在本文中,我们提出了一种基于多任务深度学习的方法,从多个维度(即精细度、色彩和谐度、色彩丰富度和整体质量)来评估手机游戏图像的美学质量。具体而言,我们首先通过整合卷积神经网络(CNN)所有中间层的特征来提取质量感知特征表示,然后通过质量回归模块将这些质量感知特征映射到每个维度的质量得分空间中,该模块由三个全连接(FC)层组成。该模型采用多任务学习的方式进行训练,其中质量感知特征由不同的质量维度预测任务共享,每张图像的多维质量分数分别由多个质量回归模块进行回归。我们进一步引入不确定性原理来平衡每个任务在训练阶段的损失。实验结果表明,在最先进的图像质量评估(IQA)算法和美学质量评估(AQA)算法中,我们提出的模型在移动游戏图像数据库(MAMG)的多维美学评估上取得了最好的性能。
{"title":"A Multi-dimensional Aesthetic Quality Assessment Model for Mobile Game Images","authors":"Tao Wang, Wei Sun, Xiongkuo Min, Wei Lu, Zicheng Zhang, Guangtao Zhai","doi":"10.1109/VCIP53242.2021.9675430","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675430","url":null,"abstract":"With the development of the game industry and the popularization of mobile devices, mobile games have played an important role in people's entertainment life. The aesthetic quality of mobile game images determines the users' Quality of Experience (QoE) to a certain extent. In this paper, we propose a multi-task deep learning based method to evaluate the aesthetic quality of mobile game images in multiple dimensions (i.e. the fineness, color harmony, colorfulness, and overall quality). Specifically, we first extract the quality-aware feature representation through integrating the features from all intermediate layers of the convolution neural network (CNN) and then map these quality-aware features into the quality score space in each dimension via the quality regressor module, which consists of three fully connected (FC) layers. The proposed model is trained through a multi-task learning manner, where the quality-aware features are shared by different quality dimension prediction tasks, and the multi-dimensional quality scores of each image are regressed by multiple quality regression modules respectively. We further introduce an uncertainty principle to balance the loss of each task in the training stage. The experimental results show that our proposed model achieves the best performance on the Multi-dimensional Aesthetic assessment for Mobile Game image database (MAMG) among state-of-the-art image quality assessment (IQA) algorithms and aesthetic quality assessment (AQA) algorithms.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128104878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Deep Learning-Based Blind Image Super-Resolution using Iterative Networks 基于迭代网络的深度学习盲图像超分辨
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675367
Asfand Yaar, H. Ateş, B. Gunturk
Deep learning-based single image super-resolution (SR) consistently shows superior performance compared to the traditional SR methods. However, most of these methods assume that the blur kernel used to generate the low-resolution (LR) image is known and fixed (e.g. bicubic). Since blur kernels involved in real-life scenarios are complex and unknown, per-formance of these SR methods is greatly reduced for real blurry images. Reconstruction of high-resolution (HR) images from randomly blurred and noisy LR images remains a challenging task. Typical blind SR approaches involve two sequential stages: i) kernel estimation; ii) SR image reconstruction based on estimated kernel. However, due to the ill-posed nature of this problem, an iterative refinement could be beneficial for both kernel and SR image estimate. With this observation, in this paper, we propose an image SR method based on deep learning with iterative kernel estimation and image reconstruction. Simulation results show that the proposed method outperforms state-of-the-art in blind image SR and produces visually superior results as well.
基于深度学习的单幅图像超分辨率(SR)与传统的单幅图像超分辨率(SR)方法相比,一直表现出优越的性能。然而,这些方法大多假设用于生成低分辨率(LR)图像的模糊核是已知和固定的(例如双三次)。由于现实场景中涉及的模糊核是复杂和未知的,这些SR方法的性能大大降低了真实模糊图像。从随机模糊和噪声LR图像中重建高分辨率图像仍然是一项具有挑战性的任务。典型的盲SR方法包括两个连续的阶段:i)核估计;ii)基于估计核的SR图像重建。然而,由于该问题的病态性质,迭代细化可能对核和SR图像估计都有益。基于此,本文提出了一种基于迭代核估计和图像重建的深度学习图像SR方法。仿真结果表明,该方法在盲图像SR中具有较好的性能,并且具有较好的视觉效果。
{"title":"Deep Learning-Based Blind Image Super-Resolution using Iterative Networks","authors":"Asfand Yaar, H. Ateş, B. Gunturk","doi":"10.1109/VCIP53242.2021.9675367","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675367","url":null,"abstract":"Deep learning-based single image super-resolution (SR) consistently shows superior performance compared to the traditional SR methods. However, most of these methods assume that the blur kernel used to generate the low-resolution (LR) image is known and fixed (e.g. bicubic). Since blur kernels involved in real-life scenarios are complex and unknown, per-formance of these SR methods is greatly reduced for real blurry images. Reconstruction of high-resolution (HR) images from randomly blurred and noisy LR images remains a challenging task. Typical blind SR approaches involve two sequential stages: i) kernel estimation; ii) SR image reconstruction based on estimated kernel. However, due to the ill-posed nature of this problem, an iterative refinement could be beneficial for both kernel and SR image estimate. With this observation, in this paper, we propose an image SR method based on deep learning with iterative kernel estimation and image reconstruction. Simulation results show that the proposed method outperforms state-of-the-art in blind image SR and produces visually superior results as well.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134133493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Analysis of VVC Intra Prediction Block Partitioning Structure VVC内预测块划分结构分析
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675347
Mário Saldanha, G. Sanchez, C. Marcon, L. Agostini
This paper presents an encoding time and encoding efficiency analysis of the Quadtree with nested Multi-type Tree (QTMT) structure in the Versatile Video Coding (VVC) intra-frame prediction. The QTMT structure enables VVC to improve the compression performance compared to its predecessor standard at the cost of a higher encoding complexity. The intra-frame prediction time raised about 26 times compared to the HEVC reference software, and most of this time is related to the new block partitioning structure. Thus, this paper provides a detailed description of the VVC block partitioning structure and an in-depth analysis of the QTMT structure regarding coding time and coding efficiency. Based on the presented analyses, this paper can guide outcoming works focusing on the block partitioning of the VVC intra-frame prediction.
本文对通用视频编码(VVC)帧内预测中嵌套多类型树(QTMT)结构的四叉树的编码时间和编码效率进行了分析。与之前的标准相比,QTMT结构使VVC能够以更高的编码复杂度为代价提高压缩性能。帧内预测时间比HEVC参考软件提高了约26倍,其中大部分时间与新的块划分结构有关。因此,本文对VVC块划分结构进行了详细描述,并对QTMT结构在编码时间和编码效率方面进行了深入分析。基于所提出的分析,本文可以指导后续的VVC帧内预测的块划分工作。
{"title":"Analysis of VVC Intra Prediction Block Partitioning Structure","authors":"Mário Saldanha, G. Sanchez, C. Marcon, L. Agostini","doi":"10.1109/VCIP53242.2021.9675347","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675347","url":null,"abstract":"This paper presents an encoding time and encoding efficiency analysis of the Quadtree with nested Multi-type Tree (QTMT) structure in the Versatile Video Coding (VVC) intra-frame prediction. The QTMT structure enables VVC to improve the compression performance compared to its predecessor standard at the cost of a higher encoding complexity. The intra-frame prediction time raised about 26 times compared to the HEVC reference software, and most of this time is related to the new block partitioning structure. Thus, this paper provides a detailed description of the VVC block partitioning structure and an in-depth analysis of the QTMT structure regarding coding time and coding efficiency. Based on the presented analyses, this paper can guide outcoming works focusing on the block partitioning of the VVC intra-frame prediction.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"5 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132329617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning MAPS:视频字幕的联合多模态注意和POS序列生成
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675348
Cong Zou, Xuchen Wang, Yaosi Hu, Zhenzhong Chen, Shan Liu
Video captioning is considered to be challenging due to the combination of video understanding and text generation. Recent progress in video captioning has been made mainly using methods of visual feature extraction and sequential learning. However, the syntax structure and semantic consistency of generated captions are not fully explored. Thus, in our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance to generate more accu-rate video captions. In general, the word sequence generation and POS sequence prediction are hierarchically jointly modeled in the framework. Specifically, different modalities including visual, motion, object and syntactic features are adaptively weighted and fused with the POS guided attention mechanism when computing the probability distributions of prediction words. Experimental results on two benchmark datasets, i.e. MSVD and MSR-VTT, demonstrate that the proposed method can not only fully exploit the information from video and text content, but also focus on the decisive feature modality when generating a word with a certain POS type. Thus, our approach boosts the video captioning performance as well as generating idiomatic captions.
由于视频理解和文本生成的结合,视频字幕被认为是具有挑战性的。近年来在视频字幕方面取得的进展主要是利用视觉特征提取和顺序学习方法。然而,对生成的标题的语法结构和语义一致性的研究并没有得到充分的探讨。因此,在我们的工作中,我们提出了一种新的基于多模态注意力的框架,该框架具有词性(POS)序列指导,以生成更准确的视频字幕。一般来说,该框架将词序列生成和词序预测分层联合建模。具体而言,在计算预测词的概率分布时,将视觉、运动、对象和句法特征等不同模态自适应加权融合到POS引导注意机制中。在MSVD和MSR-VTT两个基准数据集上的实验结果表明,所提出的方法不仅可以充分利用视频和文本内容中的信息,而且在生成具有特定词性类型的词时,还可以关注决定性的特征模态。因此,我们的方法提高了视频字幕的性能,并生成了习惯的字幕。
{"title":"MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning","authors":"Cong Zou, Xuchen Wang, Yaosi Hu, Zhenzhong Chen, Shan Liu","doi":"10.1109/VCIP53242.2021.9675348","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675348","url":null,"abstract":"Video captioning is considered to be challenging due to the combination of video understanding and text generation. Recent progress in video captioning has been made mainly using methods of visual feature extraction and sequential learning. However, the syntax structure and semantic consistency of generated captions are not fully explored. Thus, in our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance to generate more accu-rate video captions. In general, the word sequence generation and POS sequence prediction are hierarchically jointly modeled in the framework. Specifically, different modalities including visual, motion, object and syntactic features are adaptively weighted and fused with the POS guided attention mechanism when computing the probability distributions of prediction words. Experimental results on two benchmark datasets, i.e. MSVD and MSR-VTT, demonstrate that the proposed method can not only fully exploit the information from video and text content, but also focus on the decisive feature modality when generating a word with a certain POS type. Thus, our approach boosts the video captioning performance as well as generating idiomatic captions.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"1129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133802942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Deformable Convolution Based No-Reference Stereoscopic Image Quality Assessment Considering Visual Feedback Mechanism 考虑视觉反馈机制的变形卷积无参考立体图像质量评价
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675324
Mingyue Zhou, Sumei Li
Simulation of human visual system (HVS) is very crucial for fitting human perception and improving assessment performance in stereoscopic image quality assessment (SIQA). In this paper, a no-reference SIQA method considering feedback mechanism and orientation selectivity of HVS is proposed. In HVS, feedback connections are indispensable during the process of human perception, which has not been studied in the existing SIQA models. Therefore, we design a new feedback module (FBM) to realize the guidance of the high-level region of visual cortex to the low-level region. In addition, given the orientation selectivity of primary visual cortex cells, a deformable feature extraction block is explored to simulate it, and the block can adaptively select the regions of interest. Meanwhile, retinal ganglion cells (RGCs) with different receptive fields have different sensitivities to objects of different sizes in the image. So a new multi receptive fields information extraction and fusion manner is realized in the network structure. Experimental results show that the proposed model is superior to the state-of-the-art no-reference SIQA methods and has excellent generalization ability.
在立体图像质量评价(SIQA)中,人眼视觉系统仿真对于拟合人眼感知和提高评价效果至关重要。本文提出了一种考虑反馈机制和HVS定向选择性的无参考SIQA方法。在HVS中,反馈连接在人类感知过程中是必不可少的,这在现有的SIQA模型中尚未得到研究。因此,我们设计了一种新的反馈模块(FBM)来实现视觉皮层的高阶区域对低阶区域的引导。此外,考虑到初级视觉皮层细胞的方向选择性,探索了一种可变形的特征提取块来模拟它,该块可以自适应地选择感兴趣的区域。同时,具有不同感受野的视网膜神经节细胞(RGCs)对图像中不同大小的物体具有不同的敏感性。从而在网络结构中实现了一种新的多感受野信息提取与融合方式。实验结果表明,该模型优于现有的无参考SIQA方法,具有良好的泛化能力。
{"title":"Deformable Convolution Based No-Reference Stereoscopic Image Quality Assessment Considering Visual Feedback Mechanism","authors":"Mingyue Zhou, Sumei Li","doi":"10.1109/VCIP53242.2021.9675324","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675324","url":null,"abstract":"Simulation of human visual system (HVS) is very crucial for fitting human perception and improving assessment performance in stereoscopic image quality assessment (SIQA). In this paper, a no-reference SIQA method considering feedback mechanism and orientation selectivity of HVS is proposed. In HVS, feedback connections are indispensable during the process of human perception, which has not been studied in the existing SIQA models. Therefore, we design a new feedback module (FBM) to realize the guidance of the high-level region of visual cortex to the low-level region. In addition, given the orientation selectivity of primary visual cortex cells, a deformable feature extraction block is explored to simulate it, and the block can adaptively select the regions of interest. Meanwhile, retinal ganglion cells (RGCs) with different receptive fields have different sensitivities to objects of different sizes in the image. So a new multi receptive fields information extraction and fusion manner is realized in the network structure. Experimental results show that the proposed model is superior to the state-of-the-art no-reference SIQA methods and has excellent generalization ability.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"287 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124574715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Generative DNA: Representation Learning for DNA-based Approximate Image Storage 生成DNA:基于DNA的近似图像存储的表示学习
Pub Date : 2021-12-05 DOI: 10.1109/VCIP53242.2021.9675366
Giulio Franzese, Yiqing Yan, G. Serra, Ivan D'Onofrio, Raja Appuswamy, P. Michiardi
Synthetic DNA has received much attention recently as a long-term archival medium alternative due to its high density and durability characteristics. However, most current work has primarily focused on using DNA as a precise storage medium. In this work, we take an alternate view of DNA. Using neural-network-based compression techniques, we transform images into a latent-space representation, which we then store on DNA. By doing so, we transform DNA into an approximate image storage medium, as images generated back from DNA are only approximate representations of the original images. Using several datasets, we investigate the storage benefits of approximation, and study the impact of DNA storage errors (substitutions, indels, bias) on the quality of approximation. In doing so, we demonstrate the feasibility and potential of viewing DNA as an approximate storage medium.
合成DNA由于其高密度和耐久性的特点,近年来作为一种长期的档案介质替代品受到了广泛的关注。然而,目前的大部分工作主要集中在使用DNA作为精确的存储介质。在这项工作中,我们对DNA采取了另一种观点。使用基于神经网络的压缩技术,我们将图像转换为潜在空间表示,然后将其存储在DNA中。通过这样做,我们将DNA转化为近似的图像存储介质,因为从DNA生成的图像只是原始图像的近似表示。使用多个数据集,我们调查了近似的存储优势,并研究了DNA存储误差(替换、索引、偏差)对近似质量的影响。在这样做的过程中,我们证明了将DNA视为近似存储介质的可行性和潜力。
{"title":"Generative DNA: Representation Learning for DNA-based Approximate Image Storage","authors":"Giulio Franzese, Yiqing Yan, G. Serra, Ivan D'Onofrio, Raja Appuswamy, P. Michiardi","doi":"10.1109/VCIP53242.2021.9675366","DOIUrl":"https://doi.org/10.1109/VCIP53242.2021.9675366","url":null,"abstract":"Synthetic DNA has received much attention recently as a long-term archival medium alternative due to its high density and durability characteristics. However, most current work has primarily focused on using DNA as a precise storage medium. In this work, we take an alternate view of DNA. Using neural-network-based compression techniques, we transform images into a latent-space representation, which we then store on DNA. By doing so, we transform DNA into an approximate image storage medium, as images generated back from DNA are only approximate representations of the original images. Using several datasets, we investigate the storage benefits of approximation, and study the impact of DNA storage errors (substitutions, indels, bias) on the quality of approximation. In doing so, we demonstrate the feasibility and potential of viewing DNA as an approximate storage medium.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127882885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2021 International Conference on Visual Communications and Image Processing (VCIP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1