IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献_第10页

MISC: Ultra-Low Bitrate Image Semantic Compression Driven by Large Multimodal Model MISC：大型多模态模型驱动的超低比特率图像语义压缩

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-27 DOI: 10.1109/TIP.2024.3515874

Chunyi Li;Guo Lu;Donghui Feng;Haoning Wu;Zicheng Zhang;Xiaohong Liu;Guangtao Zhai;Weisi Lin;Wenjun Zhang

With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, all existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. During recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on https://github.com/lcysyzxdxc/MISC.

{"title":"MISC: Ultra-Low Bitrate Image Semantic Compression Driven by Large Multimodal Model","authors":"Chunyi Li;Guo Lu;Donghui Feng;Haoning Wu;Zicheng Zhang;Xiaohong Liu;Guangtao Zhai;Weisi Lin;Wenjun Zhang","doi":"10.1109/TIP.2024.3515874","DOIUrl":"10.1109/TIP.2024.3515874","url":null,"abstract":"With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, all existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. During recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on <uri>https://github.com/lcysyzxdxc/MISC</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"335-349"},"PeriodicalIF":0.0,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking Masked Representation Learning for 3D Point Cloud Understanding 对三维点云理解的掩模表示学习的再思考

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-26 DOI: 10.1109/TIP.2024.3520008

Chuxin Wang;Yixin Zha;Jianfeng He;Wenfei Yang;Tianzhu Zhang

Self-supervised point cloud representation learning aims to acquire robust and general feature representations from unlabeled data. Recently, masked point modeling-based methods have shown significant performance improvements for point cloud understanding, yet these methods rely on overlapping grouping strategies (k-nearest neighbor algorithm) resulting in early leakage of structural information of mask groups, and overlook the semantic modeling of object components resulting in parts with the same semantics having obvious feature differences due to position differences. In this work, we rethink grouping strategies and pretext tasks that are more suitable for self-supervised point cloud representation learning and propose a novel hierarchical masked representation learning method, including an optimal transport-based hierarchical grouping strategy, a prototype-based part modeling module, and a hierarchical attention encoder. The proposed method enjoys several merits. First, the proposed grouping strategy partitions the point cloud into non-overlapping groups, eliminating the early leakage of structural information in the masked groups. Second, the proposed prototype-based part modeling module dynamically models different object components, ensuring feature consistency on parts with the same semantics. Extensive experiments on four downstream tasks demonstrate that our method surpasses state-of-the-art 3D representation learning methods. Furthermore, Comprehensive ablation studies and visualizations demonstrate the effectiveness of the proposed modules.

{"title":"Rethinking Masked Representation Learning for 3D Point Cloud Understanding","authors":"Chuxin Wang;Yixin Zha;Jianfeng He;Wenfei Yang;Tianzhu Zhang","doi":"10.1109/TIP.2024.3520008","DOIUrl":"10.1109/TIP.2024.3520008","url":null,"abstract":"Self-supervised point cloud representation learning aims to acquire robust and general feature representations from unlabeled data. Recently, masked point modeling-based methods have shown significant performance improvements for point cloud understanding, yet these methods rely on overlapping grouping strategies (k-nearest neighbor algorithm) resulting in early leakage of structural information of mask groups, and overlook the semantic modeling of object components resulting in parts with the same semantics having obvious feature differences due to position differences. In this work, we rethink grouping strategies and pretext tasks that are more suitable for self-supervised point cloud representation learning and propose a novel hierarchical masked representation learning method, including an optimal transport-based hierarchical grouping strategy, a prototype-based part modeling module, and a hierarchical attention encoder. The proposed method enjoys several merits. First, the proposed grouping strategy partitions the point cloud into non-overlapping groups, eliminating the early leakage of structural information in the masked groups. Second, the proposed prototype-based part modeling module dynamically models different object components, ensuring feature consistency on parts with the same semantics. Extensive experiments on four downstream tasks demonstrate that our method surpasses state-of-the-art 3D representation learning methods. Furthermore, Comprehensive ablation studies and visualizations demonstrate the effectiveness of the proposed modules.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"247-262"},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model CLIP4STR：一个简单的基线场景文本识别与预训练的视觉语言模型

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-25 DOI: 10.1109/TIP.2024.3512354

Shuai Zhao;Ruijie Quan;Linchao Zhu;Yi Yang

Pre-trained vision-language models (VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. Our method establishes a simple yet strong baseline for future STR research with VLMs.

预训练的视觉语言模型（vlm）是各种下游任务的事实上的基础模型。然而，场景文本识别方法仍然倾向于在单一模态（即视觉模态）上预训练的主干，尽管vlm具有作为强大的场景文本阅读器的潜力。例如，CLIP可以健壮地识别图像中的规则（水平）和不规则（旋转、弯曲、模糊或遮挡）文本。利用这些优点，我们将CLIP转换成一个场景文本阅读器，并介绍了基于CLIP图像和文本编码器的简单有效的STR方法CLIP4STR。它有两个编码器-解码器分支：可视分支和跨模态分支。视觉分支提供基于视觉特征的初始预测，跨模态分支通过解决视觉特征和文本语义之间的差异来细化该预测。为了充分利用这两个分支的功能，我们设计了一个用于推理的双重预测和优化解码方案。我们根据模型大小、预训练数据和训练数据对CLIP4STR进行缩放，在13个STR基准上实现了最先进的性能。此外，本文还提供了一项全面的实证研究，以增强对CLIP对STR适应的理解。我们的方法为未来的VLMs STR研究建立了一个简单而有力的基线。

{"title":"CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model","authors":"Shuai Zhao;Ruijie Quan;Linchao Zhu;Yi Yang","doi":"10.1109/TIP.2024.3512354","DOIUrl":"10.1109/TIP.2024.3512354","url":null,"abstract":"Pre-trained vision-language models (VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. Our method establishes a simple yet strong baseline for future STR research with VLMs.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6893-6904"},"PeriodicalIF":0.0,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HomuGAN: A 3D-Aware GAN With the Method of Cylindrical Spatial-Constrained Sampling 基于圆柱空间约束采样方法的三维感知GAN

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-25 DOI: 10.1109/TIP.2024.3520423

Haochen Yu;Weixi Gong;Jiansheng Chen;Huimin Ma

Controllable 3D-aware scene synthesis seeks to disentangle the various latent codes in the implicit space enabling the generation network to create highly realistic images with 3D consistency. Recent approaches often integrate Neural Radiance Fields with the upsampling method of StyleGAN2, employing Convolutions with style modulation to transform spatial coordinates into frequency domain representations. Our analysis indicates that this approach can give rise to a bubble phenomenon in StyleNeRF. We argue that the style modulation introduces extraneous information into the implicit space, disrupting 3D implicit modeling and degrading image quality. We introduce HomuGAN, incorporating two key improvements. First, we disentangle the style modulation applied to implicit modeling from that utilized for super-resolution, thus alleviating the bubble phenomenon. Second, we introduce Cylindrical Spatial-Constrained Sampling and Parabolic Sampling. The latter sampling method, as an alternative method to the former, specifically contributes to the performance of foreground modeling of vehicles. We evaluate HomuGAN on publicly available datasets, comparing its performance to existing methods. Empirical results demonstrate that our model achieves the best performance, exhibiting relatively outstanding disentanglement capability. Moreover, HomuGAN addresses the training instability problem observed in StyleNeRF and reduces the bubble phenomenon.

{"title":"HomuGAN: A 3D-Aware GAN With the Method of Cylindrical Spatial-Constrained Sampling","authors":"Haochen Yu;Weixi Gong;Jiansheng Chen;Huimin Ma","doi":"10.1109/TIP.2024.3520423","DOIUrl":"10.1109/TIP.2024.3520423","url":null,"abstract":"Controllable 3D-aware scene synthesis seeks to disentangle the various latent codes in the implicit space enabling the generation network to create highly realistic images with 3D consistency. Recent approaches often integrate Neural Radiance Fields with the upsampling method of StyleGAN2, employing Convolutions with style modulation to transform spatial coordinates into frequency domain representations. Our analysis indicates that this approach can give rise to a bubble phenomenon in StyleNeRF. We argue that the style modulation introduces extraneous information into the implicit space, disrupting 3D implicit modeling and degrading image quality. We introduce HomuGAN, incorporating two key improvements. First, we disentangle the style modulation applied to implicit modeling from that utilized for super-resolution, thus alleviating the bubble phenomenon. Second, we introduce Cylindrical Spatial-Constrained Sampling and Parabolic Sampling. The latter sampling method, as an alternative method to the former, specifically contributes to the performance of foreground modeling of vehicles. We evaluate HomuGAN on publicly available datasets, comparing its performance to existing methods. Empirical results demonstrate that our model achieves the best performance, exhibiting relatively outstanding disentanglement capability. Moreover, HomuGAN addresses the training instability problem observed in StyleNeRF and reduces the bubble phenomenon.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"320-334"},"PeriodicalIF":0.0,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HYRE: Hybrid Regressor for 3D Human Pose and Shape Estimation HYRE：用于三维人体姿态和形状估计的混合回归器

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-25 DOI: 10.1109/TIP.2024.3515872

Wenhao Li;Mengyuan Liu;Hong Liu;Bin Ren;Xia Li;Yingxuan You;Nicu Sebe

Regression-based 3D human pose and shape estimation often fall into one of two different paradigms. Parametric approaches, which regress the parameters of a human body model, tend to produce physically plausible but image-mesh misalignment results. In contrast, non-parametric approaches directly regress human mesh vertices, resulting in pixel-aligned but unreasonable predictions. In this paper, we consider these two paradigms together for a better overall estimation. To this end, we propose a novel HYbrid REgressor (HYRE) that greatly benefits from the joint learning of both paradigms. The core of our HYRE is a hybrid intermediary across paradigms that provides complementary clues to each paradigm at the shared feature level and fuses their results at the part-based decision level, thereby bridging the gap between the two. We demonstrate the effectiveness of the proposed method through both quantitative and qualitative experimental analyses, resulting in improvements for each approach and ultimately leading to better hybrid results. Our experiments show that HYRE outperforms previous methods on challenging 3D human pose and shape benchmarks.

{"title":"HYRE: Hybrid Regressor for 3D Human Pose and Shape Estimation","authors":"Wenhao Li;Mengyuan Liu;Hong Liu;Bin Ren;Xia Li;Yingxuan You;Nicu Sebe","doi":"10.1109/TIP.2024.3515872","DOIUrl":"10.1109/TIP.2024.3515872","url":null,"abstract":"Regression-based 3D human pose and shape estimation often fall into one of two different paradigms. Parametric approaches, which regress the parameters of a human body model, tend to produce physically plausible but image-mesh misalignment results. In contrast, non-parametric approaches directly regress human mesh vertices, resulting in pixel-aligned but unreasonable predictions. In this paper, we consider these two paradigms together for a better overall estimation. To this end, we propose a novel HYbrid REgressor (HYRE) that greatly benefits from the joint learning of both paradigms. The core of our HYRE is a hybrid intermediary across paradigms that provides complementary clues to each paradigm at the shared feature level and fuses their results at the part-based decision level, thereby bridging the gap between the two. We demonstrate the effectiveness of the proposed method through both quantitative and qualitative experimental analyses, resulting in improvements for each approach and ultimately leading to better hybrid results. Our experiments show that HYRE outperforms previous methods on challenging 3D human pose and shape benchmarks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"235-246"},"PeriodicalIF":0.0,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Residual Quotient Learning for Zero-Reference Low-Light Image Enhancement 残差商学习的零参考微光图像增强

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-24 DOI: 10.1109/TIP.2024.3519997

Chao Xie;Linfeng Fei;Huanjie Tao;Yaocong Hu;Wei Zhou;Jiun Tian Hoe;Weipeng Hu;Yap-Peng Tan

Recently, neural networks have become the dominant approach to low-light image enhancement (LLIE), with at least one-third of them adopting a Retinex-related architecture. However, through in-depth analysis, we contend that this most widely accepted LLIE structure is suboptimal, particularly when addressing the non-uniform illumination commonly observed in natural images. In this paper, we present a novel variant learning framework, termed residual quotient learning, to substantially alleviate this issue. Instead of following the existing Retinex-related decomposition-enhancement-reconstruction process, our basic idea is to explicitly reformulate the light enhancement task as adaptively predicting the latent quotient with reference to the original low-light input using a residual learning fashion. By leveraging the proposed residual quotient learning, we develop a lightweight yet effective network called ResQ-Net. This network features enhanced non-uniform illumination modeling capabilities, making it more suitable for real-world LLIE tasks. Moreover, due to its well-designed structure and reference-free loss function, ResQ-Net is flexible in training as it allows for zero-reference optimization, which further enhances the generalization and adaptability of our entire framework. Extensive experiments on various benchmark datasets demonstrate the merits and effectiveness of the proposed residual quotient learning, and our trained ResQ-Net outperforms state-of-the-art methods both qualitatively and quantitatively. Furthermore, a practical application in dark face detection is explored, and the preliminary results confirm the potential and feasibility of our method in real-world scenarios.

最近，神经网络已经成为低光图像增强（LLIE）的主要方法，其中至少有三分之一采用了与视网膜相关的架构。然而，通过深入分析，我们认为这种最广泛接受的LLIE结构是次优的，特别是在处理自然图像中常见的不均匀照明时。在本文中，我们提出了一种新的变体学习框架，称为残差商学习，以大大缓解这一问题。我们的基本思路是明确地将光增强任务重新表述为参考原始弱光输入，使用残差学习方式自适应地预测潜在商，而不是遵循现有的与维甲酸相关的分解-增强-重建过程。通过利用提出的残差商学习，我们开发了一个轻量级但有效的网络，称为ResQ-Net。该网络具有增强的非均匀照明建模能力，使其更适合现实世界的LLIE任务。此外，由于其设计良好的结构和无参考损失函数，ResQ-Net在训练中具有灵活性，可以进行零参考优化，这进一步增强了我们整个框架的泛化和适应性。在各种基准数据集上的大量实验证明了所提出的残差商学习的优点和有效性，并且我们训练的ResQ-Net在定性和定量上都优于最先进的方法。此外，本文还探讨了在暗人脸检测中的实际应用，初步结果证实了该方法在现实场景中的潜力和可行性。

{"title":"Residual Quotient Learning for Zero-Reference Low-Light Image Enhancement","authors":"Chao Xie;Linfeng Fei;Huanjie Tao;Yaocong Hu;Wei Zhou;Jiun Tian Hoe;Weipeng Hu;Yap-Peng Tan","doi":"10.1109/TIP.2024.3519997","DOIUrl":"10.1109/TIP.2024.3519997","url":null,"abstract":"Recently, neural networks have become the dominant approach to low-light image enhancement (LLIE), with at least one-third of them adopting a Retinex-related architecture. However, through in-depth analysis, we contend that this most widely accepted LLIE structure is suboptimal, particularly when addressing the non-uniform illumination commonly observed in natural images. In this paper, we present a novel variant learning framework, termed residual quotient learning, to substantially alleviate this issue. Instead of following the existing Retinex-related decomposition-enhancement-reconstruction process, our basic idea is to explicitly reformulate the light enhancement task as adaptively predicting the latent quotient with reference to the original low-light input using a residual learning fashion. By leveraging the proposed residual quotient learning, we develop a lightweight yet effective network called ResQ-Net. This network features enhanced non-uniform illumination modeling capabilities, making it more suitable for real-world LLIE tasks. Moreover, due to its well-designed structure and reference-free loss function, ResQ-Net is flexible in training as it allows for zero-reference optimization, which further enhances the generalization and adaptability of our entire framework. Extensive experiments on various benchmark datasets demonstrate the merits and effectiveness of the proposed residual quotient learning, and our trained ResQ-Net outperforms state-of-the-art methods both qualitatively and quantitatively. Furthermore, a practical application in dark face detection is explored, and the preliminary results confirm the potential and feasibility of our method in real-world scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"365-378"},"PeriodicalIF":0.0,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142884230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation 考虑结构感知蒸馏的再平衡视觉语言检索

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-23 DOI: 10.1109/TIP.2024.3518759

Yang Yang;Wenjuan Xi;Luping Zhou;Jinhui Tang

Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.

视觉语言检索的目的是基于来自另一模态的查询，在一模态中搜索相似的实例。主要目标是学习潜在公共空间中的跨模态匹配表示。实际上，跨模态匹配的假设是模态平衡，其中每个模态包含足够的信息来表示其他模态。然而，噪声干扰和模态不足往往导致模态不平衡，使其成为实践中常见的现象。不平衡对检索性能的影响仍然是一个悬而未决的问题。在本文中，我们首先证明了当不平衡模态存在时，最终的跨模态匹配通常是次优的。当面对不平衡模态时，公共空间实例的结构会受到固有的影响，这对跨模态相似性测量提出了挑战。为了解决这个问题，我们强调了有意义的结构保留匹配的重要性。因此，我们提出了一种简单而有效的方法，通过学习结构保留的匹配表示来重新平衡跨模态匹配。具体来说，我们设计了一种新的多粒度跨模态匹配，它结合了结构感知蒸馏和跨模态匹配损失。而跨模态匹配损失约束了实例级匹配，结构感知蒸馏通过发展的关系匹配进一步规范了学习到的匹配表征与模态内表征之间的几何一致性。在不同数据集上进行的大量实验证实了我们的方法具有优越的跨模态检索性能，同时与基线模型相比增强了单模态检索能力。

{"title":"Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation","authors":"Yang Yang;Wenjuan Xi;Luping Zhou;Jinhui Tang","doi":"10.1109/TIP.2024.3518759","DOIUrl":"10.1109/TIP.2024.3518759","url":null,"abstract":"Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6881-6892"},"PeriodicalIF":0.0,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142879658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Passive Non-Line-of-Sight Imaging With Light Transport Modulation 无源非视线成像与光传输调制

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-23 DOI: 10.1109/TIP.2024.3518097

Jiarui Zhang;Ruixu Geng;Xiaolong Du;Yan Chen;Houqiang Li;Yang Hu

Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at https://github.com/JerryOctopus/NLOS-LTM.

被动非视距成像（NLOS）由于能够对视距外的物体进行成像，近年来得到了迅速的发展。光输运条件在此任务中起着重要作用，因为改变条件会导致不同的成像模型。现有的基于学习的NLOS方法通常针对不同的光输运条件训练独立的模型，计算效率低，影响了模型的实用性。在这项工作中，我们提出了NLOS- ltm，这是一种新型的被动NLOS成像方法，可以有效地处理单个网络中的多种光传输条件。我们通过从投影图像推断潜在光传输表示并使用该表示来调制从投影图像重建隐藏图像的网络来实现这一点。我们训练了一个光传输编码器和一个矢量量化器来获得光传输表示。为了进一步规范这种表示，我们在训练过程中共同学习重构网络和重投影网络。采用一组光传输调制块对两个联合训练的网络进行多尺度调制。在大规模被动NLOS数据集上的大量实验证明了该方法的优越性。代码可在https://github.com/JerryOctopus/NLOS-LTM上获得。

{"title":"Passive Non-Line-of-Sight Imaging With Light Transport Modulation","authors":"Jiarui Zhang;Ruixu Geng;Xiaolong Du;Yan Chen;Houqiang Li;Yang Hu","doi":"10.1109/TIP.2024.3518097","DOIUrl":"10.1109/TIP.2024.3518097","url":null,"abstract":"Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at <uri>https://github.com/JerryOctopus/NLOS-LTM</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"410-424"},"PeriodicalIF":0.0,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142879933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ultra-Low Bitrate Face Video Compression Based on Conversions From 3D Keypoints to 2D Motion Map 基于三维关键点到二维运动图转换的超低比特率人脸视频压缩

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-20 DOI: 10.1109/TIP.2024.3518100

Zhao Wang;Bolin Chen;Shurun Wang;Shiqi Wang;Yan Ye;Siwei Ma

How to compress face video is a crucial problem for a series of online applications, such as video chat/conference, live broadcasting and remote education. Compared to other natural videos, these face-centric videos owning abundant structural information can be compactly represented and high-quality reconstructed via deep generative models, such that the promising compression performance can be achieved. However, the existing generative face video compression schemes are faced with the inconsistency between the 3D facial motion in the physical world and the face content evolution in the 2D view. To solve this drawback, we propose a 3D-Keypoint-and-2D-Motion based generative method for Face Video Compression, namely FVC-3K2M, which can well ensure perceptual compensation and visual consistency between motion description and face reconstruction. In particular, the temporal evolution of face video can be characterized into separate 3D keypoints from the global and local perspectives, entailing great coding flexibility and accurate motion representation. Moreover, a cascade motion conversion mechanism is further proposed to internally convert 3D keypoints to 2D dense motion, enforcing the face video reconstruction to be perceptually realistic. Finally, an adaptive reference frame selection scheme is developed to enhance the adaptation of various temporal movements. Experimental results show that the proposed scheme can realize reliable video communication in the extremely limited bandwidth, e.g., 2 kbps. Compared to the state-of-the-art video coding standards and the latest face video compression methods, extensive comparisons demonstrate that our proposed scheme achieves superior compression performance in terms of multiple quality evaluations.

如何压缩人脸视频是视频聊天/会议、直播和远程教育等一系列在线应用的关键问题。与其他自然视频相比，这些以人脸为中心的视频具有丰富的结构信息，可以通过深度生成模型进行紧凑的表示和高质量的重构，从而获得良好的压缩性能。然而，现有的生成式人脸视频压缩方案面临着物理世界中三维人脸运动与二维视图中人脸内容演化不一致的问题。为了解决这一缺陷，我们提出了一种基于3d -关键点和2d -运动的人脸视频压缩生成方法FVC-3K2M，该方法可以很好地保证运动描述和人脸重构之间的感知补偿和视觉一致性。特别是，人脸视频的时间演变可以从全局和局部角度划分为单独的3D关键点，从而具有很大的编码灵活性和准确的运动表示。进一步提出了级联运动转换机制，将三维关键点内部转换为二维密集运动，增强了人脸视频重建的感知真实感。最后，提出了一种自适应参考帧选择方案，以增强对各种时间运动的适应性。实验结果表明，该方案可以在非常有限的带宽（如2kbps）下实现可靠的视频通信。与最先进的视频编码标准和最新的人脸视频压缩方法相比，广泛的比较表明，我们提出的方案在多个质量评估方面具有优越的压缩性能。

{"title":"Ultra-Low Bitrate Face Video Compression Based on Conversions From 3D Keypoints to 2D Motion Map","authors":"Zhao Wang;Bolin Chen;Shurun Wang;Shiqi Wang;Yan Ye;Siwei Ma","doi":"10.1109/TIP.2024.3518100","DOIUrl":"10.1109/TIP.2024.3518100","url":null,"abstract":"How to compress face video is a crucial problem for a series of online applications, such as video chat/conference, live broadcasting and remote education. Compared to other natural videos, these face-centric videos owning abundant structural information can be compactly represented and high-quality reconstructed via deep generative models, such that the promising compression performance can be achieved. However, the existing generative face video compression schemes are faced with the inconsistency between the 3D facial motion in the physical world and the face content evolution in the 2D view. To solve this drawback, we propose a 3D-Keypoint-and-2D-Motion based generative method for Face Video Compression, namely FVC-3K2M, which can well ensure perceptual compensation and visual consistency between motion description and face reconstruction. In particular, the temporal evolution of face video can be characterized into separate 3D keypoints from the global and local perspectives, entailing great coding flexibility and accurate motion representation. Moreover, a cascade motion conversion mechanism is further proposed to internally convert 3D keypoints to 2D dense motion, enforcing the face video reconstruction to be perceptually realistic. Finally, an adaptive reference frame selection scheme is developed to enhance the adaptation of various temporal movements. Experimental results show that the proposed scheme can realize reliable video communication in the extremely limited bandwidth, e.g., 2 kbps. Compared to the state-of-the-art video coding standards and the latest face video compression methods, extensive comparisons demonstrate that our proposed scheme achieves superior compression performance in terms of multiple quality evaluations.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6850-6864"},"PeriodicalIF":0.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142867126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling Dual-Exposure Quad-Bayer Patterns for Joint Denoising and Deblurring 双曝光四拜耳模式联合去噪和去模糊建模

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-20 DOI: 10.1109/TIP.2024.3515873

Yuzhi Zhao;Lai-Man Po;Xin Ye;Yongzhe Xu;Qiong Yan

Image degradation caused by noise and blur remains a persistent challenge in imaging systems, stemming from limitations in both hardware and methodology. Single-image solutions face an inherent tradeoff between noise reduction and motion blur. While short exposures can capture clear motion, they suffer from noise amplification. Long exposures reduce noise but introduce blur. Learning-based single-image enhancers tend to be over-smooth due to the limited information. Multi-image solutions using burst mode avoid this tradeoff by capturing more spatial-temporal information but often struggle with misalignment from camera/scene motion. To address these limitations, we propose a physical-model-based image restoration approach leveraging a novel dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long exposures at the same starting point but with varying durations, this method integrates complementary noise-blur information within a single image. We further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data from Bayer patterns to facilitate training. Based on this dual-exposure sensor model, we design a hierarchical convolutional neural network called QRNet to recover high-quality RGB images. The network incorporates input enhancement blocks and multi-level feature extraction to improve restoration quality. Experiments demonstrate superior performance over state-of-the-art deblurring and denoising methods on both synthetic and real-world datasets. The code, model, and datasets are publicly available at https://github.com/zhaoyuzhi/QRNet.

由于硬件和方法的限制，噪声和模糊引起的图像退化仍然是成像系统中一个持续的挑战。单图像解决方案面临噪声降低和运动模糊之间的固有权衡。虽然短曝光可以捕捉到清晰的运动，但它们会受到噪声放大的影响。长时间曝光可以减少噪点，但会产生模糊。基于学习的单图像增强器由于信息有限，往往过于平滑。使用连拍模式的多图像解决方案通过捕获更多的时空信息来避免这种权衡，但经常与相机/场景运动的不对准作斗争。为了解决这些限制，我们提出了一种基于物理模型的图像恢复方法，利用一种新型的双曝光Quad-Bayer模式传感器。通过在同一起始点捕获不同持续时间的短曝光和长曝光对，该方法在单个图像中集成了互补的噪声模糊信息。我们进一步引入了一种四元拜耳合成方法（B2QB）来模拟来自拜耳模式的传感器数据，以方便训练。基于这种双曝光传感器模型，我们设计了一种称为QRNet的分层卷积神经网络来恢复高质量的RGB图像。该网络结合了输入增强块和多级特征提取，提高了恢复质量。实验证明了在合成和真实世界数据集上优于最先进的去模糊和去噪方法的性能。代码、模型和数据集可以在https://github.com/zhaoyuzhi/QRNet上公开获得。

{"title":"Modeling Dual-Exposure Quad-Bayer Patterns for Joint Denoising and Deblurring","authors":"Yuzhi Zhao;Lai-Man Po;Xin Ye;Yongzhe Xu;Qiong Yan","doi":"10.1109/TIP.2024.3515873","DOIUrl":"10.1109/TIP.2024.3515873","url":null,"abstract":"Image degradation caused by noise and blur remains a persistent challenge in imaging systems, stemming from limitations in both hardware and methodology. Single-image solutions face an inherent tradeoff between noise reduction and motion blur. While short exposures can capture clear motion, they suffer from noise amplification. Long exposures reduce noise but introduce blur. Learning-based single-image enhancers tend to be over-smooth due to the limited information. Multi-image solutions using burst mode avoid this tradeoff by capturing more spatial-temporal information but often struggle with misalignment from camera/scene motion. To address these limitations, we propose a physical-model-based image restoration approach leveraging a novel dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long exposures at the same starting point but with varying durations, this method integrates complementary noise-blur information within a single image. We further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data from Bayer patterns to facilitate training. Based on this dual-exposure sensor model, we design a hierarchical convolutional neural network called QRNet to recover high-quality RGB images. The network incorporates input enhancement blocks and multi-level feature extraction to improve restoration quality. Experiments demonstrate superior performance over state-of-the-art deblurring and denoising methods on both synthetic and real-world datasets. The code, model, and datasets are publicly available at <uri>https://github.com/zhaoyuzhi/QRNet</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"350-364"},"PeriodicalIF":0.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142867124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0