IEEE Journal on Emerging and Selected Topics in Circuits and Systems最新文献_第6页

FVIFormer: Flow-Guided Global-Local Aggregation Transformer Network for Video Inpainting FVIFormer：用于视频绘制的流量引导全局-本地聚合变换器网络

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-25 DOI: 10.1109/JETCAS.2024.3392972

Weiqing Yan;Yiqiu Sun;Guanghui Yue;Wei Zhou;Hantao Liu

Video inpainting has been extensively used in recent years. Established works usually utilise the similarity between the missing region and its surrounding features to inpaint in the visually damaged content in a multi-stage manner. However, due to the complexity of the video content, it may result in the destruction of structural information of objects within the video. In addition to this, the presence of moving objects in the damaged regions of the video can further increase the difficulty of this work. To address these issues, we propose a flow-guided global-Local aggregation Transformer network for video inpainting. First, we use a pre-trained optical flow complementation network to repair the defective optical flow of video frames. Then, we propose a content inpainting module, which use the complete optical flow as a guide, and propagate the global content across the video frames using efficient temporal and spacial Transformer to inpaint in the corrupted regions of the video. Finally, we propose a structural rectification module to enhance the coherence of content around the missing regions via combining the extracted local and global features. In addition, considering the efficiency of the overall framework, we also optimized the self-attention mechanism to improve the speed of training and testing via depth-wise separable encoding. We validate the effectiveness of our method on the YouTube-VOS and DAVIS video datasets. Extensive experiment results demonstrate the effectiveness of our approach in edge-complementing video content that has undergone stabilisation algorithms.

近年来，视频内画技术得到了广泛应用。已有的工作通常是利用缺失区域与其周围特征之间的相似性，以多阶段的方式对视觉上受损的内容进行补绘。然而，由于视频内容的复杂性，可能会导致视频中物体结构信息的破坏。除此之外，视频中受损区域存在移动物体也会进一步增加这项工作的难度。为了解决这些问题，我们提出了一种用于视频内画的流量引导全局-局部聚合变换器网络。首先，我们使用预先训练好的光流互补网络来修复视频帧的缺陷光流。然后，我们提出了一个内容喷绘模块，该模块以完整的光流为指导，利用高效的时空变换器在视频帧中传播全局内容，对视频中的损坏区域进行喷绘。最后，我们提出了一个结构矫正模块，通过结合提取的局部和全局特征来增强缺失区域周围内容的一致性。此外，考虑到整体框架的效率，我们还优化了自我注意机制，通过深度可分离编码提高了训练和测试的速度。我们在 YouTube-VOS 和 DAVIS 视频数据集上验证了我们方法的有效性。广泛的实验结果证明了我们的方法在对经过稳定算法处理的视频内容进行边缘补全时的有效性。

{"title":"FVIFormer: Flow-Guided Global-Local Aggregation Transformer Network for Video Inpainting","authors":"Weiqing Yan;Yiqiu Sun;Guanghui Yue;Wei Zhou;Hantao Liu","doi":"10.1109/JETCAS.2024.3392972","DOIUrl":"10.1109/JETCAS.2024.3392972","url":null,"abstract":"Video inpainting has been extensively used in recent years. Established works usually utilise the similarity between the missing region and its surrounding features to inpaint in the visually damaged content in a multi-stage manner. However, due to the complexity of the video content, it may result in the destruction of structural information of objects within the video. In addition to this, the presence of moving objects in the damaged regions of the video can further increase the difficulty of this work. To address these issues, we propose a flow-guided global-Local aggregation Transformer network for video inpainting. First, we use a pre-trained optical flow complementation network to repair the defective optical flow of video frames. Then, we propose a content inpainting module, which use the complete optical flow as a guide, and propagate the global content across the video frames using efficient temporal and spacial Transformer to inpaint in the corrupted regions of the video. Finally, we propose a structural rectification module to enhance the coherence of content around the missing regions via combining the extracted local and global features. In addition, considering the efficiency of the overall framework, we also optimized the self-attention mechanism to improve the speed of training and testing via depth-wise separable encoding. We validate the effectiveness of our method on the YouTube-VOS and DAVIS video datasets. Extensive experiment results demonstrate the effectiveness of our approach in edge-complementing video content that has undergone stabilisation algorithms.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"235-244"},"PeriodicalIF":3.7,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140805941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Image Quality by Reducing Compression Artifacts Using Dynamic Window Swin Transformer 利用动态窗口斯温变换器减少压缩伪影，提高图像质量

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-24 DOI: 10.1109/JETCAS.2024.3392868

Zhenchao Ma;Yixiao Wang;Hamid Reza Tohidypour;Panos Nasiopoulos;Victor C. M. Leung

Video/image compression codecs utilize the characteristics of the human visual system and its varying sensitivity to certain frequencies, brightness, contrast, and colors to achieve high compression. Inevitably, compression introduces undesirable visual artifacts. As compression standards improve, restoring image quality becomes more challenging. Recently, deep learning based models, especially transformer-based image restoration models, have emerged as a promising approach for reducing compression artifacts, demonstrating very good restoration performance. However, all the proposed transformer based restoration methods use a same fixed window size, confining pixel dependencies in fixed areas. In this paper, we propose a new and unique image restoration method that addresses the shortcoming of existing methods by first introducing a content adaptive dynamic window that is applied to self-attention layers which in turn are weighted by our channel and spatial attention module utilized in Swin Transformer to mainly capture long and medium range pixel dependencies. In addition, local dependencies are further enhanced by integrating a CNN based network inside the Swin Transformer Block to process the image augmented by our self-attention module. Performance evaluations using images compressed by one of the latest compression standards, namely the Versatile Video Coding (VVC), when measured in Peak Signal-to-Noise Ratio (PSNR), our proposed approach achieves an average gain of 1.32dB on three different benchmark datasets for VVC compression artifacts reduction. Additionally, our proposed approach improves the visual quality of compressed images by an average of 2.7% in terms of Video Multimethod Assessment Fusion (VMAF).

视频/图像压缩编解码器利用人类视觉系统的特点及其对某些频率、亮度、对比度和色彩的不同敏感度来实现高压缩。压缩不可避免地会带来不理想的视觉效果。随着压缩标准的提高，恢复图像质量变得更具挑战性。最近，基于深度学习的模型，尤其是基于变换器的图像复原模型，已成为减少压缩伪影的一种有前途的方法，并表现出非常好的恢复性能。然而，所有基于变换器的还原方法都使用相同的固定窗口大小，将像素相关性限制在固定区域内。在本文中，我们针对现有方法的不足，提出了一种新颖独特的图像修复方法，首先引入内容自适应动态窗口，将其应用于自关注层，然后由我们在 Swin 变换器中使用的通道和空间关注模块进行加权，以主要捕捉长距离和中距离像素依赖性。此外，通过在 Swin Transformer Block 中集成一个基于 CNN 的网络来处理由我们的自我关注模块增强的图像，进一步增强了局部依赖性。通过使用最新压缩标准（即多功能视频编码（VVC））压缩的图像进行性能评估，以峰值信噪比（PSNR）衡量，我们提出的方法在三个不同的基准数据集上实现了 1.32dB 的平均增益，从而减少了 VVC 压缩的人工痕迹。此外，在视频多方法评估融合（VMAF）方面，我们提出的方法平均提高了压缩图像的视觉质量 2.7%。

{"title":"Enhancing Image Quality by Reducing Compression Artifacts Using Dynamic Window Swin Transformer","authors":"Zhenchao Ma;Yixiao Wang;Hamid Reza Tohidypour;Panos Nasiopoulos;Victor C. M. Leung","doi":"10.1109/JETCAS.2024.3392868","DOIUrl":"10.1109/JETCAS.2024.3392868","url":null,"abstract":"Video/image compression codecs utilize the characteristics of the human visual system and its varying sensitivity to certain frequencies, brightness, contrast, and colors to achieve high compression. Inevitably, compression introduces undesirable visual artifacts. As compression standards improve, restoring image quality becomes more challenging. Recently, deep learning based models, especially transformer-based image restoration models, have emerged as a promising approach for reducing compression artifacts, demonstrating very good restoration performance. However, all the proposed transformer based restoration methods use a same fixed window size, confining pixel dependencies in fixed areas. In this paper, we propose a new and unique image restoration method that addresses the shortcoming of existing methods by first introducing a content adaptive dynamic window that is applied to self-attention layers which in turn are weighted by our channel and spatial attention module utilized in Swin Transformer to mainly capture long and medium range pixel dependencies. In addition, local dependencies are further enhanced by integrating a CNN based network inside the Swin Transformer Block to process the image augmented by our self-attention module. Performance evaluations using images compressed by one of the latest compression standards, namely the Versatile Video Coding (VVC), when measured in Peak Signal-to-Noise Ratio (PSNR), our proposed approach achieves an average gain of 1.32dB on three different benchmark datasets for VVC compression artifacts reduction. Additionally, our proposed approach improves the visual quality of compressed images by an average of 2.7% in terms of Video Multimethod Assessment Fusion (VMAF).","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"275-285"},"PeriodicalIF":3.7,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140805942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low Latency Variational Autoencoder on FPGAs FPGA 上的低延迟变异自动编码器

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-16 DOI: 10.1109/JETCAS.2024.3389660

Zhiqiang Que;Minghao Zhang;Hongxiang Fan;He Li;Ce Guo;Wayne Luk

Variational Autoencoders (VAEs) are at the forefront of generative model research, combining probabilistic theory with neural networks to learn intricate data structures and synthesize complex data. However, designs targeting VAEs are computationally intensive, often involving high latency that precludes real-time operations. This paper introduces a novel low-latency hardware pipeline on FPGAs for fully-stochastic VAE inference. We propose a custom Gaussian sampling layer and a layer-wise tailored pipeline architecture which, for the first time in accelerating VAEs, are optimized through High-Level Synthesis (HLS). Evaluation results show that our VAE design is respectively 82 times and 208 times faster than CPU and GPU implementations. When compared with a state-of-the-art FPGA-based autoencoder design for anomaly detection, our VAE design is 61 times faster with the same model accuracy, which shows that our approach contributes to high performance and low latency FPGA-based VAE systems.

变异自动编码器（VAE）是生成模型研究的前沿，它将概率论与神经网络相结合，学习复杂的数据结构并合成复杂的数据。然而，以 VAE 为目标的设计计算密集，往往涉及高延迟，无法进行实时操作。本文介绍了 FPGA 上用于全随机 VAE 推断的新型低延迟硬件流水线。我们提出了一个定制的高斯采样层和一个分层定制的流水线架构，这在加速 VAE 方面是首次通过高级合成（HLS）进行优化。评估结果表明，我们的 VAE 设计分别比 CPU 和 GPU 实现快 82 倍和 208 倍。与最先进的基于 FPGA 的异常检测自动编码器设计相比，在模型精度相同的情况下，我们的 VAE 设计要快 61 倍，这表明我们的方法有助于实现高性能和低延迟的基于 FPGA 的 VAE 系统。

引用次数: 0

CGVC-T: Contextual Generative Video Compression With Transformers CGVC-T：使用变形器的上下文生成式视频压缩

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-10 DOI: 10.1109/JETCAS.2024.3387301

Pengli Du;Ying Liu;Nam Ling

With the high demands for video streaming, recent years have witnessed a growing interest in utilizing deep learning for video compression. Most existing neural video compression approaches adopt the predictive residue coding framework, which is sub-optimal in removing redundancy across frames. In addition, purely minimizing the pixel-wise differences between the raw frame and the decompressed frame is ineffective in improving the perceptual quality of videos. In this paper, we propose a contextual generative video compression method with transformers (CGVC-T), which adopts generative adversarial networks (GAN) for perceptual quality enhancement and applies contextual coding to improve coding efficiency. Besides, we employ a hybrid transformer-convolution structure in the auto-encoders of the CGVC-T, which learns both global and local features within video frames to remove temporal and spatial redundancy. Furthermore, we introduce novel entropy models to estimate the probability distributions of the compressed latent representations, so that the bit rates required for transmitting the compressed video are decreased. The experiments on HEVC, UVG, and MCL-JCV datasets demonstrate that the perceptual quality of our CGVC-T in terms of FID, KID, and LPIPS scores surpasses state-of-the-art learned video codecs, the industrial video codecs x264 and x265, as well as the official reference software JM, HM, and VTM. Our CGVC-T also offers superior DISTS scores among all compared learned video codecs.

近年来，随着视频流的高需求，人们对利用深度学习进行视频压缩的兴趣与日俱增。现有的神经视频压缩方法大多采用预测残差编码框架，这种框架在消除帧间冗余方面不够理想。此外，纯粹最小化原始帧与解压缩帧之间的像素差异也无法有效改善视频的感知质量。本文提出了一种带变换器的上下文生成式视频压缩方法（CGVC-T），它采用生成式对抗网络（GAN）来提高感知质量，并应用上下文编码来提高编码效率。此外，我们在 CGVC-T 的自动编码器中采用了混合变换器-卷积结构，学习视频帧内的全局和局部特征，以消除时间和空间冗余。此外，我们还引入了新颖的熵模型来估计压缩潜在表示的概率分布，从而降低了传输压缩视频所需的比特率。在 HEVC、UVG 和 MCL-JCV 数据集上的实验表明，我们的 CGVC-T 在 FID、KID 和 LPIPS 分数方面的感知质量超过了最先进的学习视频编解码器、工业视频编解码器 x264 和 x265，以及官方参考软件 JM、HM 和 VTM。在所有比较过的学习视频编解码器中，我们的 CGVC-T 在 DISTS 分数上也更胜一筹。

{"title":"CGVC-T: Contextual Generative Video Compression With Transformers","authors":"Pengli Du;Ying Liu;Nam Ling","doi":"10.1109/JETCAS.2024.3387301","DOIUrl":"10.1109/JETCAS.2024.3387301","url":null,"abstract":"With the high demands for video streaming, recent years have witnessed a growing interest in utilizing deep learning for video compression. Most existing neural video compression approaches adopt the predictive residue coding framework, which is sub-optimal in removing redundancy across frames. In addition, purely minimizing the pixel-wise differences between the raw frame and the decompressed frame is ineffective in improving the perceptual quality of videos. In this paper, we propose a contextual generative video compression method with transformers (CGVC-T), which adopts generative adversarial networks (GAN) for perceptual quality enhancement and applies contextual coding to improve coding efficiency. Besides, we employ a hybrid transformer-convolution structure in the auto-encoders of the CGVC-T, which learns both global and local features within video frames to remove temporal and spatial redundancy. Furthermore, we introduce novel entropy models to estimate the probability distributions of the compressed latent representations, so that the bit rates required for transmitting the compressed video are decreased. The experiments on HEVC, UVG, and MCL-JCV datasets demonstrate that the perceptual quality of our CGVC-T in terms of FID, KID, and LPIPS scores surpasses state-of-the-art learned video codecs, the industrial video codecs x264 and x265, as well as the official reference software JM, HM, and VTM. Our CGVC-T also offers superior DISTS scores among all compared learned video codecs.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"209-223"},"PeriodicalIF":3.7,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140571482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Physically Guided Generative Adversarial Network for Holographic 3D Content Generation From Multi-View Light Field 从多视角光场生成全息三维内容的物理引导生成对抗网络

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-09 DOI: 10.1109/JETCAS.2024.3386672

Yunhui Zeng;Zhenwei Long;Yawen Qiu;Shiyi Wang;Junjie Wei;Xin Jin;Hongkun Cao;Zhiheng Li

Realizing high-fidelity three-dimensional (3D) scene representation through holography presents a formidable challenge, primarily due to the unknown mechanism of the optimal hologram and huge computational load as well as memory usage. Herein, we propose a Physically Guided Generative Adversarial Network (PGGAN), which is the first generative model to transform the multi-view light field directly to holographic 3D content. PGGAN harmoniously fuses the fidelity of data-driven learning with the rigor of physical optics principles, ensuring a stable reconstruction quality across wide field of view, which is unreachable by current central-view-centric approaches. The proposed framework presents an innovative encoder-generator-discriminator, which is informed by a physical optics model. It benefits from the speed and adaptability of data-driven methods to facilitate rapid learning and effectively transfer to novel scenes, while its physics-based guidance ensures that the generated holograms adhere to holographic standards. A unique, differentiable physical model facilitates end-to-end training, which aligns the generative process with the “holographic space”, thereby improving the quality of the reconstructed light fields. Employing an adaptive loss strategy, PGGAN dynamically adjusts the influence of physical guidance in the initial training stages, later optimizing for reconstruction accuracy. Empirical evaluations reveal PGGAN’s exceptional ability to swiftly generate a detailed hologram in as little as 0.002 seconds, significantly eclipsing current state-of-the-art techniques in speed while maintaining superior angular reconstruction fidelity. These results demonstrate PGGAN’s effectiveness in producing high-quality holograms rapidly from multi-view datasets, advancing real-time holographic rendering significantly.

通过全息技术实现高保真三维（3D）场景呈现是一项艰巨的挑战，这主要是由于最佳全息图的未知机制以及巨大的计算负荷和内存占用。在此，我们提出了物理引导生成对抗网络（PGGAN），这是首个将多视角光场直接转换为全息三维内容的生成模型。PGGAN 将数据驱动学习的保真度与物理光学原理的严谨性和谐地融合在一起，确保了在宽视场范围内稳定的重建质量，这是目前以中心视场为中心的方法所无法达到的。提议的框架提出了一种创新的编码器-生成器-判别器，它以物理光学模型为基础。它得益于数据驱动方法的速度和适应性，可促进快速学习并有效转移到新场景，而其基于物理的指导可确保生成的全息图符合全息标准。独特的可微分物理模型有助于端到端训练，使生成过程与 "全息空间 "保持一致，从而提高重建光场的质量。PGGAN 采用自适应损失策略，在初始训练阶段动态调整物理引导的影响，随后优化重建精度。实证评估显示，PGGAN 能够在短短 0.002 秒内迅速生成详细的全息图，在速度上大大超过了目前最先进的技术，同时还能保持出色的角度重建保真度。这些结果表明，PGGAN 能有效地从多视角数据集快速生成高质量的全息图，大大推进了实时全息渲染技术的发展。

{"title":"Physically Guided Generative Adversarial Network for Holographic 3D Content Generation From Multi-View Light Field","authors":"Yunhui Zeng;Zhenwei Long;Yawen Qiu;Shiyi Wang;Junjie Wei;Xin Jin;Hongkun Cao;Zhiheng Li","doi":"10.1109/JETCAS.2024.3386672","DOIUrl":"10.1109/JETCAS.2024.3386672","url":null,"abstract":"Realizing high-fidelity three-dimensional (3D) scene representation through holography presents a formidable challenge, primarily due to the unknown mechanism of the optimal hologram and huge computational load as well as memory usage. Herein, we propose a Physically Guided Generative Adversarial Network (PGGAN), which is the first generative model to transform the multi-view light field directly to holographic 3D content. PGGAN harmoniously fuses the fidelity of data-driven learning with the rigor of physical optics principles, ensuring a stable reconstruction quality across wide field of view, which is unreachable by current central-view-centric approaches. The proposed framework presents an innovative encoder-generator-discriminator, which is informed by a physical optics model. It benefits from the speed and adaptability of data-driven methods to facilitate rapid learning and effectively transfer to novel scenes, while its physics-based guidance ensures that the generated holograms adhere to holographic standards. A unique, differentiable physical model facilitates end-to-end training, which aligns the generative process with the “holographic space”, thereby improving the quality of the reconstructed light fields. Employing an adaptive loss strategy, PGGAN dynamically adjusts the influence of physical guidance in the initial training stages, later optimizing for reconstruction accuracy. Empirical evaluations reveal PGGAN’s exceptional ability to swiftly generate a detailed hologram in as little as 0.002 seconds, significantly eclipsing current state-of-the-art techniques in speed while maintaining superior angular reconstruction fidelity. These results demonstrate PGGAN’s effectiveness in producing high-quality holograms rapidly from multi-view datasets, advancing real-time holographic rendering significantly.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"286-298"},"PeriodicalIF":3.7,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140592787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Human–Machine Collaborative Image Compression Method Based on Implicit Neural Representations 基于隐式神经表征的人机协作图像压缩方法

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-09 DOI: 10.1109/JETCAS.2024.3386639

Huanyang Li;Xinfeng Zhang

With the explosive increase in the volume of images intended for analysis by AI, image coding for machine have been proposed to transmit information in a machine-interpretable format, thereby enhancing image compression efficiency. However, such efficient coding schemes often lead to issues like loss of image details and features, and unclear semantic information due to high data compression ratio, making them less suitable for human vision domains. Thus, it is a critical problem to balance image visual quality and machine vision accuracy at a given compression ratio. To address these issues, we introduce a human-machine collaborative image coding framework based on Implicit Neural Representations (INR), which effectively reduces the transmitted information for machine vision tasks at the decoding side while maintaining high-efficiency image compression for human vision against INR compression framework. To enhance the model’s perception of images for machine vision, we design a semantic embedding enhancement module to assist in understanding image semantics. Specifically, we employ the Swin Transformer model to initialize image features, ensuring that the embedding of the compression model are effectively applicable to downstream visual tasks. Extensive experimental results demonstrate that our method significantly outperforms other image compression methods in classification tasks while ensuring image compression efficiency.

随着用于人工智能分析的图像数量呈爆炸式增长，人们提出了机器图像编码方案，以机器可解读的格式传输信息，从而提高图像压缩效率。然而，这种高效的编码方案往往会导致图像细节和特征的丢失，以及由于高数据压缩比导致语义信息不清晰等问题，使其不太适合人类视觉领域。因此，如何在给定的压缩比下平衡图像视觉质量和机器视觉精度是一个关键问题。为了解决这些问题，我们引入了基于隐式神经表征（INR）的人机协作图像编码框架，该框架在解码端有效减少了机器视觉任务的传输信息，同时在 INR 压缩框架下保持了人类视觉的高效图像压缩。为了增强机器视觉模型对图像的感知，我们设计了一个语义嵌入增强模块，以帮助理解图像语义。具体来说，我们采用 Swin Transformer 模型来初始化图像特征，确保压缩模型的嵌入有效地适用于下游视觉任务。广泛的实验结果表明，我们的方法在分类任务中明显优于其他图像压缩方法，同时确保了图像压缩效率。

{"title":"Human–Machine Collaborative Image Compression Method Based on Implicit Neural Representations","authors":"Huanyang Li;Xinfeng Zhang","doi":"10.1109/JETCAS.2024.3386639","DOIUrl":"10.1109/JETCAS.2024.3386639","url":null,"abstract":"With the explosive increase in the volume of images intended for analysis by AI, image coding for machine have been proposed to transmit information in a machine-interpretable format, thereby enhancing image compression efficiency. However, such efficient coding schemes often lead to issues like loss of image details and features, and unclear semantic information due to high data compression ratio, making them less suitable for human vision domains. Thus, it is a critical problem to balance image visual quality and machine vision accuracy at a given compression ratio. To address these issues, we introduce a human-machine collaborative image coding framework based on Implicit Neural Representations (INR), which effectively reduces the transmitted information for machine vision tasks at the decoding side while maintaining high-efficiency image compression for human vision against INR compression framework. To enhance the model’s perception of images for machine vision, we design a semantic embedding enhancement module to assist in understanding image semantics. Specifically, we employ the Swin Transformer model to initialize image features, ensuring that the embedding of the compression model are effectively applicable to downstream visual tasks. Extensive experimental results demonstrate that our method significantly outperforms other image compression methods in classification tasks while ensuring image compression efficiency.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"198-208"},"PeriodicalIF":3.7,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPGA Codec System of Learned Image Compression With Algorithm-Architecture Co-Optimization 算法-架构协同优化的学习图像压缩 FPGA 编解码器系统

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-08 DOI: 10.1109/JETCAS.2024.3386328

Heming Sun;Qingyang Yi;Masahiro Fujita

Learned Image Compression (LIC) has shown a coding ability competitive to traditional standards. To address the complexity issue of LIC, various hardware accelerators are required. As one category of accelerators, FPGA has been used because of its good reconfigurability and high power efficiency. However, the prior work developed the algorithm of LIC neural network at first, and then proposed an associated FPGA hardware. This separate manner of algorithm and architecture development can easily cause a layout problem such as routing congestion when the hardware utilization is high. To mitigate this problem, this paper gives an algorithm-architecture co- optimization of LIC. We first restrict the input and output channel parallelism with some constraints to ease the routing issue with more DSP usage. After that, we adjust the numbers of channels to increase the DSP efficiency. As a result, compared with one recent work with a fine-grained pipelined architecture, we can reach up to 1.5x faster throughput with almost the same coding performance on the Kodak dataset. Compared with another recent work accelerated by AMD/Xilinx DPU, we can reach faster throughput with better coding performance.

学习图像压缩（LIC）已显示出可与传统标准相媲美的编码能力。为解决 LIC 的复杂性问题，需要各种硬件加速器。作为加速器的一种，FPGA 因其良好的可重构性和高能效而被广泛使用。然而，之前的工作都是先开发 LIC 神经网络的算法，然后再提出相关的 FPGA 硬件。这种算法和架构分开开发的方式很容易在硬件利用率较高时造成布局问题，如路由拥塞。为了缓解这一问题，本文给出了 LIC 的算法与架构协同优化方案。首先，我们通过一些约束条件限制输入和输出通道的并行性，从而在更多使用 DSP 的情况下缓解路由问题。然后，我们调整通道数量，以提高 DSP 效率。因此，与最近一项采用细粒度流水线架构的研究相比，我们在柯达数据集上的编码性能几乎相同，但吞吐量却提高了 1.5 倍。与 AMD/Xilinx DPU 加速的另一项最新研究相比，我们的吞吐量更快，编码性能更好。

{"title":"FPGA Codec System of Learned Image Compression With Algorithm-Architecture Co-Optimization","authors":"Heming Sun;Qingyang Yi;Masahiro Fujita","doi":"10.1109/JETCAS.2024.3386328","DOIUrl":"10.1109/JETCAS.2024.3386328","url":null,"abstract":"Learned Image Compression (LIC) has shown a coding ability competitive to traditional standards. To address the complexity issue of LIC, various hardware accelerators are required. As one category of accelerators, FPGA has been used because of its good reconfigurability and high power efficiency. However, the prior work developed the algorithm of LIC neural network at first, and then proposed an associated FPGA hardware. This separate manner of algorithm and architecture development can easily cause a layout problem such as routing congestion when the hardware utilization is high. To mitigate this problem, this paper gives an algorithm-architecture co- optimization of LIC. We first restrict the input and output channel parallelism with some constraints to ease the routing issue with more DSP usage. After that, we adjust the numbers of channels to increase the DSP efficiency. As a result, compared with one recent work with a fine-grained pipelined architecture, we can reach up to 1.5x faster throughput with almost the same coding performance on the Kodak dataset. Compared with another recent work accelerated by AMD/Xilinx DPU, we can reach faster throughput with better coding performance.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"334-347"},"PeriodicalIF":3.7,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140592693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative Refinement for Low Bitrate Image Coding Using Vector Quantized Residual 使用矢量量化残差进行低比特率图像编码的生成式改进

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-05 DOI: 10.1109/JETCAS.2024.3385653

Yuzhuo Kong;Ming Lu;Zhan Ma

Despite the significant progress in recent deep learning-based image compression, the reconstructed visual quality still suffers at low bitrates due to the lack of high-frequency information. Existing methods deploy the generative adversarial networks (GANs) as an additional loss to supervise the rate-distortion (R-D) optimization, capable of producing more high-frequency components for visually pleasing reconstruction but also introducing unexpected fake textures. This work, instead, proposes to generate high-frequency residuals to refine an image reconstruction compressed using existing image compression solutions. Such a residual signal is calculated between the decoded image and its uncompressed input and quantized to proper codeword vectors in a learnable codebook for decoder-side generative refinement. Extensive experiments demonstrate that our method can restore high-frequency information given images compressed by any codecs and outperform the state-of-the-art generative image compression algorithms or perceptual-oriented post-processing approaches. Moreover, the proposed method using vector quantized residual exhibits remarkable robustness and generalizes to both rules-based and learning-based compression models, which can be used as a plug-and-play module for perceptual optimization without re-training.

尽管最近基于深度学习的图像压缩技术取得了重大进展，但由于缺乏高频信息，在低比特率下重建的视觉质量仍然受到影响。现有方法采用生成式对抗网络（GANs）作为额外损失来监督速率-失真（R-D）优化，能够生成更多高频成分以实现视觉愉悦的重建，但也会引入意想不到的虚假纹理。相反，这项工作建议生成高频残差，以完善使用现有图像压缩解决方案压缩的图像重建。这种残差信号是在解码图像和未压缩输入图像之间计算出来的，并量化为可学习编码本中的适当编码词向量，用于解码器侧的生成式细化。大量实验证明，我们的方法可以还原由任何编解码器压缩的图像的高频信息，并优于最先进的生成式图像压缩算法或以感知为导向的后处理方法。此外，所提出的使用矢量量化残差的方法具有显著的鲁棒性，并可推广到基于规则和基于学习的压缩模型，可用作感知优化的即插即用模块，无需重新训练。

{"title":"Generative Refinement for Low Bitrate Image Coding Using Vector Quantized Residual","authors":"Yuzhuo Kong;Ming Lu;Zhan Ma","doi":"10.1109/JETCAS.2024.3385653","DOIUrl":"10.1109/JETCAS.2024.3385653","url":null,"abstract":"Despite the significant progress in recent deep learning-based image compression, the reconstructed visual quality still suffers at low bitrates due to the lack of high-frequency information. Existing methods deploy the generative adversarial networks (GANs) as an additional loss to supervise the rate-distortion (R-D) optimization, capable of producing more high-frequency components for visually pleasing reconstruction but also introducing unexpected fake textures. This work, instead, proposes to generate high-frequency residuals to refine an image reconstruction compressed using existing image compression solutions. Such a residual signal is calculated between the decoded image and its uncompressed input and quantized to proper codeword vectors in a learnable codebook for decoder-side generative refinement. Extensive experiments demonstrate that our method can restore high-frequency information given images compressed by any codecs and outperform the state-of-the-art generative image compression algorithms or perceptual-oriented post-processing approaches. Moreover, the proposed method using vector quantized residual exhibits remarkable robustness and generalizes to both rules-based and learning-based compression models, which can be used as a plug-and-play module for perceptual optimization without re-training.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"185-197"},"PeriodicalIF":3.7,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140592901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PKU-AIGI-500K: A Neural Compression Benchmark and Model for AI-Generated Images PKU-AIGI-500K：人工智能生成图像的神经压缩基准和模型

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-04-05 DOI: 10.1109/JETCAS.2024.3385629

Xunxu Duan;Siwei Ma;Hongbin Liu;Chuanmin Jia

In recent years, artificial intelligence-generated content (AIGC) enabled by foundation models has received increasing attention and is undergoing remarkable development. Text prompts can be elegantly translated/converted into high-quality, photo-realistic images. This remarkable feature, however, has introduced extremely high bandwidth requirements for compressing and transmitting the vast number of AI-generated images (AIGI) for such AIGC services. Despite this challenge, research on compression methods for AIGI is conspicuously lacking but undeniably necessary. This research addresses this critical gap by introducing the pioneering AIGI dataset, PKU-AIGI-500K, encompassing over 105k+ diverse prompts and 528k+ images derived from five major foundation models. Through this dataset, we delve into exploring and analyzing the essential characteristics of AIGC images and empirically prove that existing data-driven lossy compression methods achieve sub-optimal or less efficient rate-distortion performance without fine-tuning, primarily due to a domain shift between AIGIs and natural images. We comprehensively benchmark the rate-distortion performance and runtime complexity analysis of conventional and learned image coding solutions that are openly available, uncovering new insights for emerging studies in AIGI compression. Moreover, to harness the full potential of redundant information in AIGI and its corresponding text, we propose an AIGI compression model (Cross-Attention Transformer Codec, CATC) trained on this dataset as a strong baseline. Subsequent experimental results demonstrate that our proposed model achieves up to 30.09% bitrate reduction compared to the state-of-the-art (SOTA) H.266/VVC codec and outperforms the SOTA learned codec, paving the way for future research in AIGI compression.

近年来，由基础模型支持的人工智能生成内容（AIGC）越来越受到关注，并正在取得显著发展。文本提示可以优雅地翻译/转换成高质量、逼真的图片。然而，这一显著特点为此类 AIGC 服务压缩和传输大量人工智能生成的图像（AIGI）带来了极高的带宽要求。尽管存在这一挑战，但针对 AIGI 压缩方法的研究明显不足，但不可否认的是，这种研究是必要的。本研究通过引入开创性的 AIGI 数据集 PKU-AIGI-500K，填补了这一关键空白，该数据集包含来自五个主要基础模型的 105k+ 多种提示和 528k+ 多张图像。通过该数据集，我们深入探索和分析了 AIGC 图像的基本特征，并通过实证证明，现有的数据驱动有损压缩方法在不进行微调的情况下，可以获得次优或效率较低的速率-失真性能，这主要是由于 AIGI 与自然图像之间的领域偏移造成的。我们对公开的传统图像编码解决方案和学习图像编码解决方案的速率失真性能和运行时复杂性分析进行了全面的基准测试，为 AIGI 压缩领域的新兴研究揭示了新的见解。此外，为了充分利用 AIGI 及其相应文本中冗余信息的潜力，我们提出了一个 AIGI 压缩模型（Cross-Attention Transformer Codec，CATC），并以此数据集为基础进行了训练。随后的实验结果表明，与最先进的（SOTA）H.266/VVC 编解码器相比，我们提出的模型实现了高达 30.09% 的比特率缩减，并且优于 SOTA 学习的编解码器，为未来的 AIGI 压缩研究铺平了道路。

{"title":"PKU-AIGI-500K: A Neural Compression Benchmark and Model for AI-Generated Images","authors":"Xunxu Duan;Siwei Ma;Hongbin Liu;Chuanmin Jia","doi":"10.1109/JETCAS.2024.3385629","DOIUrl":"10.1109/JETCAS.2024.3385629","url":null,"abstract":"In recent years, artificial intelligence-generated content (AIGC) enabled by foundation models has received increasing attention and is undergoing remarkable development. Text prompts can be elegantly translated/converted into high-quality, photo-realistic images. This remarkable feature, however, has introduced extremely high bandwidth requirements for compressing and transmitting the vast number of AI-generated images (AIGI) for such AIGC services. Despite this challenge, research on compression methods for AIGI is conspicuously lacking but undeniably necessary. This research addresses this critical gap by introducing the pioneering AIGI dataset, PKU-AIGI-500K, encompassing over 105k+ diverse prompts and 528k+ images derived from five major foundation models. Through this dataset, we delve into exploring and analyzing the essential characteristics of AIGC images and empirically prove that existing data-driven lossy compression methods achieve sub-optimal or less efficient rate-distortion performance without fine-tuning, primarily due to a domain shift between AIGIs and natural images. We comprehensively benchmark the rate-distortion performance and runtime complexity analysis of conventional and learned image coding solutions that are openly available, uncovering new insights for emerging studies in AIGI compression. Moreover, to harness the full potential of redundant information in AIGI and its corresponding text, we propose an AIGI compression model (Cross-Attention Transformer Codec, CATC) trained on this dataset as a strong baseline. Subsequent experimental results demonstrate that our proposed model achieves up to 30.09% bitrate reduction compared to the state-of-the-art (SOTA) H.266/VVC codec and outperforms the SOTA learned codec, paving the way for future research in AIGI compression.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"172-184"},"PeriodicalIF":3.7,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140592692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Survey on Visual Signal Coding and Processing With Generative Models: Technologies, Standards, and Optimization 使用生成模型的视觉信号编码和处理调查：技术、标准和优化

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Pub Date : 2024-03-21 DOI: 10.1109/JETCAS.2024.3403524

Zhibo Chen;Heming Sun;Li Zhang;Fan Zhang

This paper provides a survey of the latest developments in visual signal coding and processing with generative models. Specifically, our focus is on presenting the advancement of generative models and their influence on research in the domain of visual signal coding and processing. This survey study begins with a brief introduction of well-established generative models, including the Variational Autoencoder (VAE) models, Generative Adversarial Network (GAN) models, Autoregressive (AR) models, Normalizing Flows and Diffusion models. The subsequent section of the paper explores the advancements in visual signal coding based on generative models, as well as the ongoing international standardization activities. In the realm of visual signal processing, our focus lies on the application and development of various generative models in the research of visual signal restoration. We also present the latest developments in generative visual signal synthesis and editing, along with visual signal quality assessment using generative models and quality assessment for generative models. The practical implementation of these studies is closely linked to the investigation of fast optimization. This paper additionally presents the latest advancements in fast optimization on visual signal coding and processing with generative models. We hope to advance this field by providing researchers and practitioners a comprehensive literature review on the topic of visual signal coding and processing with generative models.

本文概述了使用生成模型进行视觉信号编码和处理的最新进展。具体来说，我们的重点是介绍生成模型的发展及其对视觉信号编码和处理领域研究的影响。本调查研究首先简要介绍了成熟的生成模型，包括变异自动编码器（VAE）模型、生成对抗网络（GAN）模型、自回归（AR）模型、归一化流和扩散模型。本文随后的章节将探讨基于生成模型的视觉信号编码技术的发展，以及正在进行的国际标准化活动。在视觉信号处理领域，我们的重点是各种生成模型在视觉信号还原研究中的应用和发展。我们还介绍了生成式视觉信号合成和编辑的最新发展，以及使用生成式模型进行的视觉信号质量评估和生成式模型的质量评估。这些研究的实际应用与快速优化研究密切相关。本文还介绍了快速优化在使用生成模型进行视觉信号编码和处理方面的最新进展。我们希望通过为研究人员和从业人员提供关于使用生成模型进行视觉信号编码和处理这一主题的全面文献综述，推动这一领域的发展。

{"title":"Survey on Visual Signal Coding and Processing With Generative Models: Technologies, Standards, and Optimization","authors":"Zhibo Chen;Heming Sun;Li Zhang;Fan Zhang","doi":"10.1109/JETCAS.2024.3403524","DOIUrl":"10.1109/JETCAS.2024.3403524","url":null,"abstract":"This paper provides a survey of the latest developments in visual signal coding and processing with generative models. Specifically, our focus is on presenting the advancement of generative models and their influence on research in the domain of visual signal coding and processing. This survey study begins with a brief introduction of well-established generative models, including the Variational Autoencoder (VAE) models, Generative Adversarial Network (GAN) models, Autoregressive (AR) models, Normalizing Flows and Diffusion models. The subsequent section of the paper explores the advancements in visual signal coding based on generative models, as well as the ongoing international standardization activities. In the realm of visual signal processing, our focus lies on the application and development of various generative models in the research of visual signal restoration. We also present the latest developments in generative visual signal synthesis and editing, along with visual signal quality assessment using generative models and quality assessment for generative models. The practical implementation of these studies is closely linked to the investigation of fast optimization. This paper additionally presents the latest advancements in fast optimization on visual signal coding and processing with generative models. We hope to advance this field by providing researchers and practitioners a comprehensive literature review on the topic of visual signal coding and processing with generative models.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"149-171"},"PeriodicalIF":3.7,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141105359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0