首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
Universal Fine-Grained Visual Categorization by Concept Guided Learning 概念引导学习的通用细粒度视觉分类
Qi Bi;Beichen Zhou;Wei Ji;Gui-Song Xia
Existing fine-grained visual categorization (FGVC) methods assume that the fine-grained semantics rest in the informative parts of an image. This assumption works well on favorable front-view object-centric images, but can face great challenges in many real-world scenarios, such as scene-centric images (e.g., street view) and adverse viewpoint (e.g., object re-identification, remote sensing). In such scenarios, the mis-/over- feature activation is likely to confuse the part selection and degrade the fine-grained representation. In this paper, we are motivated to design a universal FGVC framework for real-world scenarios. More precisely, we propose a concept guided learning (CGL), which models concepts of a certain fine-grained category as a combination of inherited concepts from its subordinate coarse-grained category and discriminative concepts from its own. The discriminative concepts is utilized to guide the fine-grained representation learning. Specifically, three key steps are designed, namely, concept mining, concept fusion, and concept constraint. On the other hand, to bridge the FGVC dataset gap under scene-centric and adverse viewpoint scenarios, a Fine-grained Land-cover Categorization Dataset (FGLCD) with 59,994 fine-grained samples is proposed. Extensive experiments show the proposed CGL: 1) has a competitive performance on conventional FGVC; 2) achieves state-of-the-art performance on fine-grained aerial scenes & scene-centric street scenes; 3) good generalization on object re-identification and fine-grained aerial object detection. The dataset and source code will be available at https://github.com/BiQiWHU/CGL.
现有的细粒度视觉分类(FGVC)方法假设细粒度语义停留在图像的信息部分。这一假设适用于有利的前视以物体为中心的图像,但在许多现实场景中可能面临巨大挑战,例如以场景为中心的图像(如街景)和不利的视点(如物体重新识别、遥感)。在这种情况下,特征激活过少或过多可能会混淆部件选择并降低细粒度表示。在本文中,我们的动机是为现实世界的场景设计一个通用的FGVC框架。更准确地说,我们提出了一种概念引导学习(CGL),它将某一细粒度类别的概念建模为来自其下属粗粒度类别的继承概念和来自其自己的判别概念的组合。利用判别概念指导细粒度表示学习。具体来说,设计了三个关键步骤,即概念挖掘、概念融合和概念约束。另一方面,为了弥补FGVC在场景中心和不利视点场景下的数据缺口,提出了一个包含59,994个细粒度样本的细粒度土地覆盖分类数据集(FGLCD)。大量实验表明:1)与传统FGVC相比,所提出的CGL具有相当的性能;2)在细粒度航拍场景和以场景为中心的街景上实现了最先进的表现;3)在目标再识别和细粒度航空目标检测方面泛化良好。数据集和源代码可在https://github.com/BiQiWHU/CGL上获得。
{"title":"Universal Fine-Grained Visual Categorization by Concept Guided Learning","authors":"Qi Bi;Beichen Zhou;Wei Ji;Gui-Song Xia","doi":"10.1109/TIP.2024.3523802","DOIUrl":"10.1109/TIP.2024.3523802","url":null,"abstract":"Existing fine-grained visual categorization (FGVC) methods assume that the fine-grained semantics rest in the informative parts of an image. This assumption works well on favorable front-view object-centric images, but can face great challenges in many real-world scenarios, such as scene-centric images (e.g., street view) and adverse viewpoint (e.g., object re-identification, remote sensing). In such scenarios, the mis-/over- feature activation is likely to confuse the part selection and degrade the fine-grained representation. In this paper, we are motivated to design a universal FGVC framework for real-world scenarios. More precisely, we propose a concept guided learning (CGL), which models concepts of a certain fine-grained category as a combination of inherited concepts from its subordinate coarse-grained category and discriminative concepts from its own. The discriminative concepts is utilized to guide the fine-grained representation learning. Specifically, three key steps are designed, namely, concept mining, concept fusion, and concept constraint. On the other hand, to bridge the FGVC dataset gap under scene-centric and adverse viewpoint scenarios, a Fine-grained Land-cover Categorization Dataset (FGLCD) with 59,994 fine-grained samples is proposed. Extensive experiments show the proposed CGL: 1) has a competitive performance on conventional FGVC; 2) achieves state-of-the-art performance on fine-grained aerial scenes & scene-centric street scenes; 3) good generalization on object re-identification and fine-grained aerial object detection. The dataset and source code will be available at <uri>https://github.com/BiQiWHU/CGL</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"394-409"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Constrained Visual Representation Learning With Bisimulation Metrics for Safe Reinforcement Learning 基于双模拟度量的约束视觉表示学习用于安全强化学习
Rongrong Wang;Yuhu Cheng;Xuesong Wang
Safe reinforcement learning aims to ensure the optimal performance while minimizing potential risks. In real-world applications, especially in scenarios that rely on visual inputs, a key challenge lies in the extraction of essential features for safe decision-making while maintaining the sample efficiency. To address this issue, we propose the constrained visual representation learning with bisimulation metrics for safe reinforcement learning (CVRL-BM). CVRL-BM constructs a sequential conditional variational inference model to compress high-dimensional visual observations into low-dimensional state representations. Additionally, safety bisimulation metrics are introduced to quantify the behavioral similarity between states, and our objective is to make the distance between any two latent state representations as close as possible to the safety bisimulation metric between their corresponding states. By integrating these two components, CVRL-BM is able to learn compact and information-rich visual state representations while satisfying predefined safety constraints. Experiments on Safety Gym show that CVRL-BM outperforms existing vision-based safe reinforcement learning methods in safety and efficacy. Particularly, CVRL-BM surpasses the state-of-the-art Safe SLAC method by achieving a 19.748% higher reward return, a 41.772% lower cost return, and a 5.027% decrease in cost regret. These results highlight the effectiveness of our proposed CVRL-BM.
安全强化学习的目的是在保证最佳性能的同时最小化潜在风险。在现实世界的应用中,特别是在依赖视觉输入的场景中,一个关键的挑战在于在保持样本效率的同时提取安全决策的基本特征。为了解决这个问题,我们提出了安全强化学习的约束视觉表征学习和双模拟度量(CVRL-BM)。CVRL-BM构建了顺序条件变分推理模型,将高维视觉观测压缩为低维状态表示。此外,引入了安全双模拟度量来量化状态之间的行为相似性,我们的目标是使任意两个潜在状态表示之间的距离尽可能接近其对应状态之间的安全双模拟度量。通过集成这两个组件,CVRL-BM能够学习紧凑且信息丰富的视觉状态表示,同时满足预定义的安全约束。在Safety Gym上的实验表明,CVRL-BM在安全性和有效性上都优于现有的基于视觉的安全强化学习方法。特别是,CVRL-BM超越了最先进的Safe SLAC方法,实现了19.748%的高回报,41.772%的低成本回报,5.027%的低成本后悔。这些结果突出了我们提出的CVRL-BM的有效性。
{"title":"Constrained Visual Representation Learning With Bisimulation Metrics for Safe Reinforcement Learning","authors":"Rongrong Wang;Yuhu Cheng;Xuesong Wang","doi":"10.1109/TIP.2024.3523798","DOIUrl":"10.1109/TIP.2024.3523798","url":null,"abstract":"Safe reinforcement learning aims to ensure the optimal performance while minimizing potential risks. In real-world applications, especially in scenarios that rely on visual inputs, a key challenge lies in the extraction of essential features for safe decision-making while maintaining the sample efficiency. To address this issue, we propose the constrained visual representation learning with bisimulation metrics for safe reinforcement learning (CVRL-BM). CVRL-BM constructs a sequential conditional variational inference model to compress high-dimensional visual observations into low-dimensional state representations. Additionally, safety bisimulation metrics are introduced to quantify the behavioral similarity between states, and our objective is to make the distance between any two latent state representations as close as possible to the safety bisimulation metric between their corresponding states. By integrating these two components, CVRL-BM is able to learn compact and information-rich visual state representations while satisfying predefined safety constraints. Experiments on Safety Gym show that CVRL-BM outperforms existing vision-based safe reinforcement learning methods in safety and efficacy. Particularly, CVRL-BM surpasses the state-of-the-art Safe SLAC method by achieving a 19.748% higher reward return, a 41.772% lower cost return, and a 5.027% decrease in cost regret. These results highlight the effectiveness of our proposed CVRL-BM.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"379-393"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reviewer Summary for Transactions on Image Processing 《图像处理汇刊》审稿人总结
{"title":"Reviewer Summary for Transactions on Image Processing","authors":"","doi":"10.1109/TIP.2024.3513592","DOIUrl":"10.1109/TIP.2024.3513592","url":null,"abstract":"","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6905-6925"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10819972","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linearly Transformed Color Guide for Low-Bitrate Diffusion-Based Image Compression 基于低比特率扩散的图像压缩的线性变换颜色指南
Tom Bordin;Thomas Maugey
This study addresses the challenge of controlling the global color aspect of images generated by a diffusion model without training or fine-tuning. We rewrite the guidance equations to ensure that the outputs are closer to a known color map, without compromising the quality of the generation. Our method results in new guidance equations. In the context of color guidance, we show that the scaling of the guidance should not decrease but rather increase throughout the diffusion process. In a second contribution, our guidance is applied in a compression framework, where we combine both semantic and general color information of the image to decode at very low cost. We show that our method is effective in improving the fidelity and realism of compressed images at extremely low bit rates ( $10^{-2}$ bpp), performing better on these criteria when compared to other classical or more semantically oriented approaches. The implementation of our method is available on gitlab at https://gitlab.inria.fr/tbordin/color-guidance.
本研究解决了在没有训练或微调的情况下控制扩散模型生成的图像的全局颜色方面的挑战。我们重写了引导方程,以确保输出更接近已知的颜色映射,而不影响生成的质量。我们的方法得到了新的制导方程。在颜色引导的背景下,我们证明了在整个扩散过程中,引导的尺度不应该减少而应该增加。在第二个贡献中,我们的指导被应用在压缩框架中,我们将图像的语义和一般颜色信息结合起来,以非常低的成本进行解码。我们表明,我们的方法在极低比特率($10^{-2}$ bpp)下有效地提高了压缩图像的保真度和真实感,与其他经典或更面向语义的方法相比,在这些标准上表现得更好。我们的方法的实现可以在gitlab的https://gitlab.inria.fr/tbordin/color-guidance上获得。
{"title":"Linearly Transformed Color Guide for Low-Bitrate Diffusion-Based Image Compression","authors":"Tom Bordin;Thomas Maugey","doi":"10.1109/TIP.2024.3521301","DOIUrl":"10.1109/TIP.2024.3521301","url":null,"abstract":"This study addresses the challenge of controlling the global color aspect of images generated by a diffusion model without training or fine-tuning. We rewrite the guidance equations to ensure that the outputs are closer to a known color map, without compromising the quality of the generation. Our method results in new guidance equations. In the context of color guidance, we show that the scaling of the guidance should not decrease but rather increase throughout the diffusion process. In a second contribution, our guidance is applied in a compression framework, where we combine both semantic and general color information of the image to decode at very low cost. We show that our method is effective in improving the fidelity and realism of compressed images at extremely low bit rates (<inline-formula> <tex-math>$10^{-2}$ </tex-math></inline-formula>bpp), performing better on these criteria when compared to other classical or more semantically oriented approaches. The implementation of our method is available on gitlab at <uri>https://gitlab.inria.fr/tbordin/color-guidance</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"468-482"},"PeriodicalIF":0.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model CLIP4STR:一个简单的基线场景文本识别与预训练的视觉语言模型
Shuai Zhao;Ruijie Quan;Linchao Zhu;Yi Yang
Pre-trained vision-language models (VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. Our method establishes a simple yet strong baseline for future STR research with VLMs.
预训练的视觉语言模型(vlm)是各种下游任务的事实上的基础模型。然而,场景文本识别方法仍然倾向于在单一模态(即视觉模态)上预训练的主干,尽管vlm具有作为强大的场景文本阅读器的潜力。例如,CLIP可以健壮地识别图像中的规则(水平)和不规则(旋转、弯曲、模糊或遮挡)文本。利用这些优点,我们将CLIP转换成一个场景文本阅读器,并介绍了基于CLIP图像和文本编码器的简单有效的STR方法CLIP4STR。它有两个编码器-解码器分支:可视分支和跨模态分支。视觉分支提供基于视觉特征的初始预测,跨模态分支通过解决视觉特征和文本语义之间的差异来细化该预测。为了充分利用这两个分支的功能,我们设计了一个用于推理的双重预测和优化解码方案。我们根据模型大小、预训练数据和训练数据对CLIP4STR进行缩放,在13个STR基准上实现了最先进的性能。此外,本文还提供了一项全面的实证研究,以增强对CLIP对STR适应的理解。我们的方法为未来的VLMs STR研究建立了一个简单而有力的基线。
{"title":"CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained Vision-Language Model","authors":"Shuai Zhao;Ruijie Quan;Linchao Zhu;Yi Yang","doi":"10.1109/TIP.2024.3512354","DOIUrl":"10.1109/TIP.2024.3512354","url":null,"abstract":"Pre-trained vision-language models (VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. Our method establishes a simple yet strong baseline for future STR research with VLMs.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6893-6904"},"PeriodicalIF":0.0,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Residual Quotient Learning for Zero-Reference Low-Light Image Enhancement 残差商学习的零参考微光图像增强
Chao Xie;Linfeng Fei;Huanjie Tao;Yaocong Hu;Wei Zhou;Jiun Tian Hoe;Weipeng Hu;Yap-Peng Tan
Recently, neural networks have become the dominant approach to low-light image enhancement (LLIE), with at least one-third of them adopting a Retinex-related architecture. However, through in-depth analysis, we contend that this most widely accepted LLIE structure is suboptimal, particularly when addressing the non-uniform illumination commonly observed in natural images. In this paper, we present a novel variant learning framework, termed residual quotient learning, to substantially alleviate this issue. Instead of following the existing Retinex-related decomposition-enhancement-reconstruction process, our basic idea is to explicitly reformulate the light enhancement task as adaptively predicting the latent quotient with reference to the original low-light input using a residual learning fashion. By leveraging the proposed residual quotient learning, we develop a lightweight yet effective network called ResQ-Net. This network features enhanced non-uniform illumination modeling capabilities, making it more suitable for real-world LLIE tasks. Moreover, due to its well-designed structure and reference-free loss function, ResQ-Net is flexible in training as it allows for zero-reference optimization, which further enhances the generalization and adaptability of our entire framework. Extensive experiments on various benchmark datasets demonstrate the merits and effectiveness of the proposed residual quotient learning, and our trained ResQ-Net outperforms state-of-the-art methods both qualitatively and quantitatively. Furthermore, a practical application in dark face detection is explored, and the preliminary results confirm the potential and feasibility of our method in real-world scenarios.
最近,神经网络已经成为低光图像增强(LLIE)的主要方法,其中至少有三分之一采用了与视网膜相关的架构。然而,通过深入分析,我们认为这种最广泛接受的LLIE结构是次优的,特别是在处理自然图像中常见的不均匀照明时。在本文中,我们提出了一种新的变体学习框架,称为残差商学习,以大大缓解这一问题。我们的基本思路是明确地将光增强任务重新表述为参考原始弱光输入,使用残差学习方式自适应地预测潜在商,而不是遵循现有的与维甲酸相关的分解-增强-重建过程。通过利用提出的残差商学习,我们开发了一个轻量级但有效的网络,称为ResQ-Net。该网络具有增强的非均匀照明建模能力,使其更适合现实世界的LLIE任务。此外,由于其设计良好的结构和无参考损失函数,ResQ-Net在训练中具有灵活性,可以进行零参考优化,这进一步增强了我们整个框架的泛化和适应性。在各种基准数据集上的大量实验证明了所提出的残差商学习的优点和有效性,并且我们训练的ResQ-Net在定性和定量上都优于最先进的方法。此外,本文还探讨了在暗人脸检测中的实际应用,初步结果证实了该方法在现实场景中的潜力和可行性。
{"title":"Residual Quotient Learning for Zero-Reference Low-Light Image Enhancement","authors":"Chao Xie;Linfeng Fei;Huanjie Tao;Yaocong Hu;Wei Zhou;Jiun Tian Hoe;Weipeng Hu;Yap-Peng Tan","doi":"10.1109/TIP.2024.3519997","DOIUrl":"10.1109/TIP.2024.3519997","url":null,"abstract":"Recently, neural networks have become the dominant approach to low-light image enhancement (LLIE), with at least one-third of them adopting a Retinex-related architecture. However, through in-depth analysis, we contend that this most widely accepted LLIE structure is suboptimal, particularly when addressing the non-uniform illumination commonly observed in natural images. In this paper, we present a novel variant learning framework, termed residual quotient learning, to substantially alleviate this issue. Instead of following the existing Retinex-related decomposition-enhancement-reconstruction process, our basic idea is to explicitly reformulate the light enhancement task as adaptively predicting the latent quotient with reference to the original low-light input using a residual learning fashion. By leveraging the proposed residual quotient learning, we develop a lightweight yet effective network called ResQ-Net. This network features enhanced non-uniform illumination modeling capabilities, making it more suitable for real-world LLIE tasks. Moreover, due to its well-designed structure and reference-free loss function, ResQ-Net is flexible in training as it allows for zero-reference optimization, which further enhances the generalization and adaptability of our entire framework. Extensive experiments on various benchmark datasets demonstrate the merits and effectiveness of the proposed residual quotient learning, and our trained ResQ-Net outperforms state-of-the-art methods both qualitatively and quantitatively. Furthermore, a practical application in dark face detection is explored, and the preliminary results confirm the potential and feasibility of our method in real-world scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"365-378"},"PeriodicalIF":0.0,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142884230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation 考虑结构感知蒸馏的再平衡视觉语言检索
Yang Yang;Wenjuan Xi;Luping Zhou;Jinhui Tang
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.
视觉语言检索的目的是基于来自另一模态的查询,在一模态中搜索相似的实例。主要目标是学习潜在公共空间中的跨模态匹配表示。实际上,跨模态匹配的假设是模态平衡,其中每个模态包含足够的信息来表示其他模态。然而,噪声干扰和模态不足往往导致模态不平衡,使其成为实践中常见的现象。不平衡对检索性能的影响仍然是一个悬而未决的问题。在本文中,我们首先证明了当不平衡模态存在时,最终的跨模态匹配通常是次优的。当面对不平衡模态时,公共空间实例的结构会受到固有的影响,这对跨模态相似性测量提出了挑战。为了解决这个问题,我们强调了有意义的结构保留匹配的重要性。因此,我们提出了一种简单而有效的方法,通过学习结构保留的匹配表示来重新平衡跨模态匹配。具体来说,我们设计了一种新的多粒度跨模态匹配,它结合了结构感知蒸馏和跨模态匹配损失。而跨模态匹配损失约束了实例级匹配,结构感知蒸馏通过发展的关系匹配进一步规范了学习到的匹配表征与模态内表征之间的几何一致性。在不同数据集上进行的大量实验证实了我们的方法具有优越的跨模态检索性能,同时与基线模型相比增强了单模态检索能力。
{"title":"Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation","authors":"Yang Yang;Wenjuan Xi;Luping Zhou;Jinhui Tang","doi":"10.1109/TIP.2024.3518759","DOIUrl":"10.1109/TIP.2024.3518759","url":null,"abstract":"Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6881-6892"},"PeriodicalIF":0.0,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142879658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Passive Non-Line-of-Sight Imaging With Light Transport Modulation 无源非视线成像与光传输调制
Jiarui Zhang;Ruixu Geng;Xiaolong Du;Yan Chen;Houqiang Li;Yang Hu
Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at https://github.com/JerryOctopus/NLOS-LTM.
被动非视距成像(NLOS)由于能够对视距外的物体进行成像,近年来得到了迅速的发展。光输运条件在此任务中起着重要作用,因为改变条件会导致不同的成像模型。现有的基于学习的NLOS方法通常针对不同的光输运条件训练独立的模型,计算效率低,影响了模型的实用性。在这项工作中,我们提出了NLOS- ltm,这是一种新型的被动NLOS成像方法,可以有效地处理单个网络中的多种光传输条件。我们通过从投影图像推断潜在光传输表示并使用该表示来调制从投影图像重建隐藏图像的网络来实现这一点。我们训练了一个光传输编码器和一个矢量量化器来获得光传输表示。为了进一步规范这种表示,我们在训练过程中共同学习重构网络和重投影网络。采用一组光传输调制块对两个联合训练的网络进行多尺度调制。在大规模被动NLOS数据集上的大量实验证明了该方法的优越性。代码可在https://github.com/JerryOctopus/NLOS-LTM上获得。
{"title":"Passive Non-Line-of-Sight Imaging With Light Transport Modulation","authors":"Jiarui Zhang;Ruixu Geng;Xiaolong Du;Yan Chen;Houqiang Li;Yang Hu","doi":"10.1109/TIP.2024.3518097","DOIUrl":"10.1109/TIP.2024.3518097","url":null,"abstract":"Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at <uri>https://github.com/JerryOctopus/NLOS-LTM</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"410-424"},"PeriodicalIF":0.0,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142879933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ultra-Low Bitrate Face Video Compression Based on Conversions From 3D Keypoints to 2D Motion Map 基于三维关键点到二维运动图转换的超低比特率人脸视频压缩
Zhao Wang;Bolin Chen;Shurun Wang;Shiqi Wang;Yan Ye;Siwei Ma
How to compress face video is a crucial problem for a series of online applications, such as video chat/conference, live broadcasting and remote education. Compared to other natural videos, these face-centric videos owning abundant structural information can be compactly represented and high-quality reconstructed via deep generative models, such that the promising compression performance can be achieved. However, the existing generative face video compression schemes are faced with the inconsistency between the 3D facial motion in the physical world and the face content evolution in the 2D view. To solve this drawback, we propose a 3D-Keypoint-and-2D-Motion based generative method for Face Video Compression, namely FVC-3K2M, which can well ensure perceptual compensation and visual consistency between motion description and face reconstruction. In particular, the temporal evolution of face video can be characterized into separate 3D keypoints from the global and local perspectives, entailing great coding flexibility and accurate motion representation. Moreover, a cascade motion conversion mechanism is further proposed to internally convert 3D keypoints to 2D dense motion, enforcing the face video reconstruction to be perceptually realistic. Finally, an adaptive reference frame selection scheme is developed to enhance the adaptation of various temporal movements. Experimental results show that the proposed scheme can realize reliable video communication in the extremely limited bandwidth, e.g., 2 kbps. Compared to the state-of-the-art video coding standards and the latest face video compression methods, extensive comparisons demonstrate that our proposed scheme achieves superior compression performance in terms of multiple quality evaluations.
如何压缩人脸视频是视频聊天/会议、直播和远程教育等一系列在线应用的关键问题。与其他自然视频相比,这些以人脸为中心的视频具有丰富的结构信息,可以通过深度生成模型进行紧凑的表示和高质量的重构,从而获得良好的压缩性能。然而,现有的生成式人脸视频压缩方案面临着物理世界中三维人脸运动与二维视图中人脸内容演化不一致的问题。为了解决这一缺陷,我们提出了一种基于3d -关键点和2d -运动的人脸视频压缩生成方法FVC-3K2M,该方法可以很好地保证运动描述和人脸重构之间的感知补偿和视觉一致性。特别是,人脸视频的时间演变可以从全局和局部角度划分为单独的3D关键点,从而具有很大的编码灵活性和准确的运动表示。进一步提出了级联运动转换机制,将三维关键点内部转换为二维密集运动,增强了人脸视频重建的感知真实感。最后,提出了一种自适应参考帧选择方案,以增强对各种时间运动的适应性。实验结果表明,该方案可以在非常有限的带宽(如2kbps)下实现可靠的视频通信。与最先进的视频编码标准和最新的人脸视频压缩方法相比,广泛的比较表明,我们提出的方案在多个质量评估方面具有优越的压缩性能。
{"title":"Ultra-Low Bitrate Face Video Compression Based on Conversions From 3D Keypoints to 2D Motion Map","authors":"Zhao Wang;Bolin Chen;Shurun Wang;Shiqi Wang;Yan Ye;Siwei Ma","doi":"10.1109/TIP.2024.3518100","DOIUrl":"10.1109/TIP.2024.3518100","url":null,"abstract":"How to compress face video is a crucial problem for a series of online applications, such as video chat/conference, live broadcasting and remote education. Compared to other natural videos, these face-centric videos owning abundant structural information can be compactly represented and high-quality reconstructed via deep generative models, such that the promising compression performance can be achieved. However, the existing generative face video compression schemes are faced with the inconsistency between the 3D facial motion in the physical world and the face content evolution in the 2D view. To solve this drawback, we propose a 3D-Keypoint-and-2D-Motion based generative method for Face Video Compression, namely FVC-3K2M, which can well ensure perceptual compensation and visual consistency between motion description and face reconstruction. In particular, the temporal evolution of face video can be characterized into separate 3D keypoints from the global and local perspectives, entailing great coding flexibility and accurate motion representation. Moreover, a cascade motion conversion mechanism is further proposed to internally convert 3D keypoints to 2D dense motion, enforcing the face video reconstruction to be perceptually realistic. Finally, an adaptive reference frame selection scheme is developed to enhance the adaptation of various temporal movements. Experimental results show that the proposed scheme can realize reliable video communication in the extremely limited bandwidth, e.g., 2 kbps. Compared to the state-of-the-art video coding standards and the latest face video compression methods, extensive comparisons demonstrate that our proposed scheme achieves superior compression performance in terms of multiple quality evaluations.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6850-6864"},"PeriodicalIF":0.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142867126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Dual-Exposure Quad-Bayer Patterns for Joint Denoising and Deblurring 双曝光四拜耳模式联合去噪和去模糊建模
Yuzhi Zhao;Lai-Man Po;Xin Ye;Yongzhe Xu;Qiong Yan
Image degradation caused by noise and blur remains a persistent challenge in imaging systems, stemming from limitations in both hardware and methodology. Single-image solutions face an inherent tradeoff between noise reduction and motion blur. While short exposures can capture clear motion, they suffer from noise amplification. Long exposures reduce noise but introduce blur. Learning-based single-image enhancers tend to be over-smooth due to the limited information. Multi-image solutions using burst mode avoid this tradeoff by capturing more spatial-temporal information but often struggle with misalignment from camera/scene motion. To address these limitations, we propose a physical-model-based image restoration approach leveraging a novel dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long exposures at the same starting point but with varying durations, this method integrates complementary noise-blur information within a single image. We further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data from Bayer patterns to facilitate training. Based on this dual-exposure sensor model, we design a hierarchical convolutional neural network called QRNet to recover high-quality RGB images. The network incorporates input enhancement blocks and multi-level feature extraction to improve restoration quality. Experiments demonstrate superior performance over state-of-the-art deblurring and denoising methods on both synthetic and real-world datasets. The code, model, and datasets are publicly available at https://github.com/zhaoyuzhi/QRNet.
由于硬件和方法的限制,噪声和模糊引起的图像退化仍然是成像系统中一个持续的挑战。单图像解决方案面临噪声降低和运动模糊之间的固有权衡。虽然短曝光可以捕捉到清晰的运动,但它们会受到噪声放大的影响。长时间曝光可以减少噪点,但会产生模糊。基于学习的单图像增强器由于信息有限,往往过于平滑。使用连拍模式的多图像解决方案通过捕获更多的时空信息来避免这种权衡,但经常与相机/场景运动的不对准作斗争。为了解决这些限制,我们提出了一种基于物理模型的图像恢复方法,利用一种新型的双曝光Quad-Bayer模式传感器。通过在同一起始点捕获不同持续时间的短曝光和长曝光对,该方法在单个图像中集成了互补的噪声模糊信息。我们进一步引入了一种四元拜耳合成方法(B2QB)来模拟来自拜耳模式的传感器数据,以方便训练。基于这种双曝光传感器模型,我们设计了一种称为QRNet的分层卷积神经网络来恢复高质量的RGB图像。该网络结合了输入增强块和多级特征提取,提高了恢复质量。实验证明了在合成和真实世界数据集上优于最先进的去模糊和去噪方法的性能。代码、模型和数据集可以在https://github.com/zhaoyuzhi/QRNet上公开获得。
{"title":"Modeling Dual-Exposure Quad-Bayer Patterns for Joint Denoising and Deblurring","authors":"Yuzhi Zhao;Lai-Man Po;Xin Ye;Yongzhe Xu;Qiong Yan","doi":"10.1109/TIP.2024.3515873","DOIUrl":"10.1109/TIP.2024.3515873","url":null,"abstract":"Image degradation caused by noise and blur remains a persistent challenge in imaging systems, stemming from limitations in both hardware and methodology. Single-image solutions face an inherent tradeoff between noise reduction and motion blur. While short exposures can capture clear motion, they suffer from noise amplification. Long exposures reduce noise but introduce blur. Learning-based single-image enhancers tend to be over-smooth due to the limited information. Multi-image solutions using burst mode avoid this tradeoff by capturing more spatial-temporal information but often struggle with misalignment from camera/scene motion. To address these limitations, we propose a physical-model-based image restoration approach leveraging a novel dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long exposures at the same starting point but with varying durations, this method integrates complementary noise-blur information within a single image. We further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data from Bayer patterns to facilitate training. Based on this dual-exposure sensor model, we design a hierarchical convolutional neural network called QRNet to recover high-quality RGB images. The network incorporates input enhancement blocks and multi-level feature extraction to improve restoration quality. Experiments demonstrate superior performance over state-of-the-art deblurring and denoising methods on both synthetic and real-world datasets. The code, model, and datasets are publicly available at <uri>https://github.com/zhaoyuzhi/QRNet</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"350-364"},"PeriodicalIF":0.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142867124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1