IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献_第8页

Difference-Complementary Learning and Label Reassignment for Multimodal Semi-Supervised Semantic Segmentation of Remote Sensing Images 遥感图像多模态半监督语义分割的差分互补学习和标签重分配

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-10 DOI: 10.1109/TIP.2025.3526064

Wenqi Han;Wen Jiang;Jie Geng;Wang Miao

The feature fusion of optical and Synthetic Aperture Radar (SAR) images is widely used for semantic segmentation of multimodal remote sensing images. It leverages information from two different sensors to enhance the analytical capabilities of land cover. However, the imaging characteristics of optical and SAR data are vastly different, and noise interference makes the fusion of multimodal data information challenging. Furthermore, in practical remote sensing applications, there are typically only a limited number of labeled samples available, with most pixels needing to be labeled. Semi-supervised learning has the potential to improve model performance in scenarios with limited labeled data. However, in remote sensing applications, the quality of pseudo-labels is frequently compromised, particularly in challenging regions such as blurred edges and areas with class confusion. This degradation in label quality can have a detrimental effect on the model’s overall performance. In this paper, we introduce the Difference-complementary Learning and Label Reassignment (DLLR) network for multimodal semi-supervised semantic segmentation of remote sensing images. Our proposed DLLR framework leverages asymmetric masking to create information discrepancies between the optical and SAR modalities, and employs a difference-guided complementary learning strategy to enable mutual learning. Subsequently, we introduce a multi-level label reassignment strategy, treating the label assignment problem as an optimal transport optimization task to allocate pixels to classes with higher precision for unlabeled pixels, thereby enhancing the quality of pseudo-label annotations. Finally, we introduce a multimodal consistency cross pseudo-supervision strategy to improve pseudo-label utilization. We evaluate our method on two multimodal remote sensing datasets, namely, the WHU-OPT-SAR and EErDS-OPT-SAR datasets. Experimental results demonstrate that our proposed DLLR model outperforms other relevant deep networks in terms of accuracy in multimodal semantic segmentation.

光学图像与合成孔径雷达（SAR）图像的特征融合被广泛用于多模态遥感图像的语义分割。它利用来自两个不同传感器的信息来增强对土地覆盖的分析能力。然而，光学数据和SAR数据的成像特性有很大的不同，噪声干扰给多模态数据信息的融合带来了挑战。此外，在实际遥感应用中，通常只有有限数量的标记样本可用，大多数像素需要标记。在标记数据有限的情况下，半监督学习有可能提高模型的性能。然而，在遥感应用中，伪标签的质量经常受到损害，特别是在边缘模糊和类别混淆等具有挑战性的区域。标签质量的下降会对模型的整体性能产生不利影响。本文引入差分互补学习和标签重分配（DLLR）网络，用于遥感图像的多模态半监督语义分割。我们提出的DLLR框架利用不对称掩蔽来创建光学和SAR模式之间的信息差异，并采用差异引导的互补学习策略来实现相互学习。随后，我们引入了一种多层次的标签重新分配策略，将标签分配问题视为最优传输优化任务，将未标记的像素分配到精度更高的类别，从而提高伪标签注释的质量。最后，我们引入了一种多模态一致性交叉伪监督策略来提高伪标签的利用率。我们在两个多模态遥感数据集，即WHU-OPT-SAR和EErDS-OPT-SAR数据集上评估了我们的方法。实验结果表明，我们提出的DLLR模型在多模态语义分割的准确性方面优于其他相关的深度网络。

{"title":"Difference-Complementary Learning and Label Reassignment for Multimodal Semi-Supervised Semantic Segmentation of Remote Sensing Images","authors":"Wenqi Han;Wen Jiang;Jie Geng;Wang Miao","doi":"10.1109/TIP.2025.3526064","DOIUrl":"10.1109/TIP.2025.3526064","url":null,"abstract":"The feature fusion of optical and Synthetic Aperture Radar (SAR) images is widely used for semantic segmentation of multimodal remote sensing images. It leverages information from two different sensors to enhance the analytical capabilities of land cover. However, the imaging characteristics of optical and SAR data are vastly different, and noise interference makes the fusion of multimodal data information challenging. Furthermore, in practical remote sensing applications, there are typically only a limited number of labeled samples available, with most pixels needing to be labeled. Semi-supervised learning has the potential to improve model performance in scenarios with limited labeled data. However, in remote sensing applications, the quality of pseudo-labels is frequently compromised, particularly in challenging regions such as blurred edges and areas with class confusion. This degradation in label quality can have a detrimental effect on the model’s overall performance. In this paper, we introduce the Difference-complementary Learning and Label Reassignment (DLLR) network for multimodal semi-supervised semantic segmentation of remote sensing images. Our proposed DLLR framework leverages asymmetric masking to create information discrepancies between the optical and SAR modalities, and employs a difference-guided complementary learning strategy to enable mutual learning. Subsequently, we introduce a multi-level label reassignment strategy, treating the label assignment problem as an optimal transport optimization task to allocate pixels to classes with higher precision for unlabeled pixels, thereby enhancing the quality of pseudo-label annotations. Finally, we introduce a multimodal consistency cross pseudo-supervision strategy to improve pseudo-label utilization. We evaluate our method on two multimodal remote sensing datasets, namely, the WHU-OPT-SAR and EErDS-OPT-SAR datasets. Experimental results demonstrate that our proposed DLLR model outperforms other relevant deep networks in terms of accuracy in multimodal semantic segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"566-580"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition 基于跨域监督信号的场景文本识别注意力引导

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-10 DOI: 10.1109/TIP.2024.3523799

Fanfu Xue;Jiande Sun;Yaqi Xue;Qiang Wu;Lei Zhu;Xiaojun Chang;Sen-Ching Cheung

Despite recent advances, scene text recognition remains a challenging problem due to the significant variability, irregularity and distortion in text appearance and localization. Attention-based methods have become the mainstream due to their superior vocabulary learning and observation ability. Nonetheless, they are susceptible to attention drift which can lead to word recognition errors. Most works focus on correcting attention drift in decoding but completely ignore the error accumulated during the encoding process. In this paper, we propose a novel scheme, called the Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition (ACDS-STR), which can mitigate the attention drift at the feature encoding stage. At the heart of the proposed scheme is the cross-domain attention guidance and feature encoding fusion module (CAFM) that uses the core areas of characters to recursively guide attention to learn in the encoding process. With precise attention information sourced from CAFM, we propose a non-attention-based adaptive transformation decoder (ATD) to guarantee decoding performance and improve decoding speed. In the training stage, we fuse manual guidance and subjective learning to learn the core areas of characters, which notably augments the recognition performance of the model. Experiments are conducted on public benchmarks and show the state-of-the-art performance. The source will be available at https://github.com/xuefanfu/ACDS-STR.

{"title":"Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition","authors":"Fanfu Xue;Jiande Sun;Yaqi Xue;Qiang Wu;Lei Zhu;Xiaojun Chang;Sen-Ching Cheung","doi":"10.1109/TIP.2024.3523799","DOIUrl":"10.1109/TIP.2024.3523799","url":null,"abstract":"Despite recent advances, scene text recognition remains a challenging problem due to the significant variability, irregularity and distortion in text appearance and localization. Attention-based methods have become the mainstream due to their superior vocabulary learning and observation ability. Nonetheless, they are susceptible to attention drift which can lead to word recognition errors. Most works focus on correcting attention drift in decoding but completely ignore the error accumulated during the encoding process. In this paper, we propose a novel scheme, called the Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition (ACDS-STR), which can mitigate the attention drift at the feature encoding stage. At the heart of the proposed scheme is the cross-domain attention guidance and feature encoding fusion module (CAFM) that uses the core areas of characters to recursively guide attention to learn in the encoding process. With precise attention information sourced from CAFM, we propose a non-attention-based adaptive transformation decoder (ATD) to guarantee decoding performance and improve decoding speed. In the training stage, we fuse manual guidance and subjective learning to learn the core areas of characters, which notably augments the recognition performance of the model. Experiments are conducted on public benchmarks and show the state-of-the-art performance. The source will be available at <uri>https://github.com/xuefanfu/ACDS-STR</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"717-728"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeuralDiffuser: Neuroscience-Inspired Diffusion Guidance for fMRI Visual Reconstruction 神经扩散：神经科学启发扩散指导的fMRI视觉重建

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-10 DOI: 10.1109/TIP.2025.3526051

Haoyu Li;Hao Wu;Badong Chen

Reconstructing visual stimuli from functional Magnetic Resonance Imaging (fMRI) enables fine-grained retrieval of brain activity. However, the accurate reconstruction of diverse details, including structure, background, texture, color, and more, remains challenging. The stable diffusion models inevitably result in the variability of reconstructed images, even under identical conditions. To address this challenge, we first uncover the neuroscientific perspective of diffusion methods, which primarily involve top-down creation using pre-trained knowledge from extensive image datasets, but tend to lack detail-driven bottom-up perception, leading to a loss of faithful details. In this paper, we propose NeuralDiffuser, which incorporates primary visual feature guidance to provide detailed cues in the form of gradients. This extension of the bottom-up process for diffusion models achieves both semantic coherence and detail fidelity when reconstructing visual stimuli. Furthermore, we have developed a novel guidance strategy for reconstruction tasks that ensures the consistency of repeated outputs with original images rather than with various outputs. Extensive experimental results on the Natural Senses Dataset (NSD) qualitatively and quantitatively demonstrate the advancement of NeuralDiffuser by comparing it against baseline and state-of-the-art methods horizontally, as well as conducting longitudinal ablation studies. Code can be available on https://github.com/HaoyyLi/NeuralDiffuser.

从功能性磁共振成像（fMRI）重建视觉刺激，使大脑活动的细粒度检索。然而，准确重建各种细节，包括结构、背景、纹理、颜色等，仍然具有挑战性。即使在相同的条件下，稳定的扩散模型也不可避免地导致重构图像的变异性。为了应对这一挑战，我们首先揭示了扩散方法的神经科学视角，该方法主要涉及使用来自广泛图像数据集的预训练知识进行自上而下的创建，但往往缺乏细节驱动的自下而上感知，导致忠实细节的丢失。在本文中，我们提出了NeuralDiffuser，它结合了主要的视觉特征指导，以梯度的形式提供详细的线索。这种自下而上的扩散模型扩展过程在重建视觉刺激时实现了语义一致性和细节保真度。此外，我们为重建任务开发了一种新的指导策略，确保重复输出与原始图像的一致性，而不是各种输出。在自然感觉数据集（NSD）上的大量实验结果定性和定量地证明了NeuralDiffuser的进步，通过将其与基线和最先进的方法进行横向比较，以及进行纵向消融研究。代码可在https://github.com/HaoyyLi/NeuralDiffuser上获得。

{"title":"NeuralDiffuser: Neuroscience-Inspired Diffusion Guidance for fMRI Visual Reconstruction","authors":"Haoyu Li;Hao Wu;Badong Chen","doi":"10.1109/TIP.2025.3526051","DOIUrl":"10.1109/TIP.2025.3526051","url":null,"abstract":"Reconstructing visual stimuli from functional Magnetic Resonance Imaging (fMRI) enables fine-grained retrieval of brain activity. However, the accurate reconstruction of diverse details, including structure, background, texture, color, and more, remains challenging. The stable diffusion models inevitably result in the variability of reconstructed images, even under identical conditions. To address this challenge, we first uncover the neuroscientific perspective of diffusion methods, which primarily involve top-down creation using pre-trained knowledge from extensive image datasets, but tend to lack detail-driven bottom-up perception, leading to a loss of faithful details. In this paper, we propose NeuralDiffuser, which incorporates primary visual feature guidance to provide detailed cues in the form of gradients. This extension of the bottom-up process for diffusion models achieves both semantic coherence and detail fidelity when reconstructing visual stimuli. Furthermore, we have developed a novel guidance strategy for reconstruction tasks that ensures the consistency of repeated outputs with original images rather than with various outputs. Extensive experimental results on the Natural Senses Dataset (NSD) qualitatively and quantitatively demonstrate the advancement of NeuralDiffuser by comparing it against baseline and state-of-the-art methods horizontally, as well as conducting longitudinal ablation studies. Code can be available on <uri>https://github.com/HaoyyLi/NeuralDiffuser</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"552-565"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Physically Realizable Adversarial Creating Attack Against Vision-Based BEV Space 3D Object Detection 针对基于视觉的BEV空间3D目标检测的物理可实现对抗性创建攻击

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-10 DOI: 10.1109/TIP.2025.3526056

Jian Wang;Fan Li;Song Lv;Lijun He;Chao Shen

Vision-based 3D object detection, a cost-effective alternative to LiDAR-based solutions, plays a crucial role in modern autonomous driving systems. Meanwhile, deep models have been proven susceptible to adversarial examples, and attacking detection models can lead to serious driving consequences. Most previous adversarial attacks targeted 2D detectors by placing the patch in a specific region within the object’s bounding box in the image, allowing it to evade detection. However, attacking 3D detector is more difficult because the adversary may be observed from different viewpoints and distances, and there is a lack of effective methods to differentiably render the 3D space poster onto the image. In this paper, we propose a novel attack setting where a carefully crafted adversarial poster (looks like meaningless graffiti) is learned and pasted on the road surface, inducing the vision-based 3D detectors to perceive a non-existent object. We show that even a single 2D poster is sufficient to deceive the 3D detector with the desired attack effect, and the poster is universal, which is effective across various scenes, viewpoints, and distances. To generate the poster, an image-3D applying algorithm is devised to establish the pixel-wise mapping relationship between the image area and the 3D space poster so that the poster can be optimized through standard backpropagation. Moreover, a ground-truth masked optimization strategy is presented to effectively learn the poster without interference from scene objects. Extensive results including real-world experiments validate the effectiveness of our adversarial attack. The transferability and defense strategy are also investigated to comprehensively understand the proposed attack.

基于视觉的3D目标检测是激光雷达解决方案的一种经济高效的替代方案，在现代自动驾驶系统中发挥着至关重要的作用。同时，深度模型已被证明容易受到对抗性示例的影响，攻击检测模型可能导致严重的驱动后果。以前的大多数对抗性攻击都是通过将补丁放置在图像中物体边界框内的特定区域来针对2D检测器，从而使其逃避检测。然而，攻击3D探测器比较困难，因为对手可能从不同的角度和距离被观察到，并且缺乏有效的方法将3D空间海报区分地渲染到图像上。在本文中，我们提出了一种新的攻击设置，其中精心制作的对抗性海报（看起来像无意义的涂鸦）被学习并粘贴在路面上，诱导基于视觉的3D探测器感知不存在的物体。我们的研究表明，即使是一张2D海报也足以用预期的攻击效果欺骗3D探测器，而且海报是通用的，这在各种场景、视点和距离上都是有效的。为了生成海报，设计了一种图像-三维应用算法，建立图像区域与三维空间海报之间逐像素的映射关系，通过标准反向传播对海报进行优化。在此基础上，提出了一种不受场景物体干扰的真实掩蔽优化策略。包括现实世界实验在内的广泛结果验证了我们对抗性攻击的有效性。为了全面理解所提出的攻击，还研究了可转移性和防御策略。

{"title":"Physically Realizable Adversarial Creating Attack Against Vision-Based BEV Space 3D Object Detection","authors":"Jian Wang;Fan Li;Song Lv;Lijun He;Chao Shen","doi":"10.1109/TIP.2025.3526056","DOIUrl":"10.1109/TIP.2025.3526056","url":null,"abstract":"Vision-based 3D object detection, a cost-effective alternative to LiDAR-based solutions, plays a crucial role in modern autonomous driving systems. Meanwhile, deep models have been proven susceptible to adversarial examples, and attacking detection models can lead to serious driving consequences. Most previous adversarial attacks targeted 2D detectors by placing the patch in a specific region within the object’s bounding box in the image, allowing it to evade detection. However, attacking 3D detector is more difficult because the adversary may be observed from different viewpoints and distances, and there is a lack of effective methods to differentiably render the 3D space poster onto the image. In this paper, we propose a novel attack setting where a carefully crafted adversarial poster (looks like meaningless graffiti) is learned and pasted on the road surface, inducing the vision-based 3D detectors to perceive a non-existent object. We show that even a single 2D poster is sufficient to deceive the 3D detector with the desired attack effect, and the poster is universal, which is effective across various scenes, viewpoints, and distances. To generate the poster, an image-3D applying algorithm is devised to establish the pixel-wise mapping relationship between the image area and the 3D space poster so that the poster can be optimized through standard backpropagation. Moreover, a ground-truth masked optimization strategy is presented to effectively learn the poster without interference from scene objects. Extensive results including real-world experiments validate the effectiveness of our adversarial attack. The transferability and defense strategy are also investigated to comprehensively understand the proposed attack.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"538-551"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

3VL: Using Trees to Improve Vision-Language Models’ Interpretability 3VL：使用树来提高视觉语言模型的可解释性

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-06 DOI: 10.1109/TIP.2024.3523801

Nir Yellinek;Leonid Karlinsky;Raja Giryes

Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects’ attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model’s success or failure.

视觉语言模型（VLMs）已被证明在对齐图像和文本表示方面是有效的，在转移到许多下游任务时产生了优越的零射击结果。然而，这些表示在理解组合语言概念（CLC）方面存在一些关键缺陷，例如识别对象的属性、状态以及不同对象之间的关系。此外，vlm通常具有较差的可解释性，这使得调试和减轻组合理解失败变得具有挑战性。在这项工作中，我们介绍了树增强视觉语言（3VL）模型的架构和训练技术，以及我们提出的锚点推理方法和差分相关性（DiRe）可解释性工具。通过使用语言分析工具将任意图像-文本对的文本扩展为分层树结构，3VL允许将该结构归纳到模型学习的视觉表示中，从而增强其可解释性和组合推理。此外，我们还展示了如何使用Anchor（一种用于文本统一的简单技术）过滤讨厌的因素，同时提高CLC理解性能，例如在基本的VL-Checklist基准上。我们还展示了在VLM相关性图之间执行差异比较的DiRe如何使我们能够生成引人注目的模型成功或失败原因的可视化。

{"title":"3VL: Using Trees to Improve Vision-Language Models’ Interpretability","authors":"Nir Yellinek;Leonid Karlinsky;Raja Giryes","doi":"10.1109/TIP.2024.3523801","DOIUrl":"10.1109/TIP.2024.3523801","url":null,"abstract":"Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects’ attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model’s success or failure.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"495-509"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diffusion Model-Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality Assessment 基于扩散模型的无参考图像质量评价视觉补偿制导与视觉差异分析

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-06 DOI: 10.1109/TIP.2024.3523800

Zhaoyang Wang;Bo Hu;Mingyang Zhang;Jie Li;Leida Li;Maoguo Gong;Xinbo Gao

Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods continue to face challenges in effectively restoring complexly distorted images. The features guiding the main network for quality assessment lack interpretability, and efficiently leveraging high-level feature information remains a significant challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enhancing image restoration effectiveness. Moreover, the intermediate variables in the denoising iteration process exhibit clearer and more interpretable meanings for high-level visual information guidance. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. We design a novel diffusion model for enhancing images with various types of distortions, resulting in higher quality and more interpretable high-level visual information. Our experiments demonstrate that the diffusion model establishes a clear mapping relationship between image reconstruction and image quality scores, which the network learns to guide quality assessment. Finally, to fully leverage high-level visual information, we design two complementary visual branches to collaboratively perform quality evaluation. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA. The codes will be available at https://github.com/handsomewzy/DiffV2IQA.

{"title":"Diffusion Model-Based Visual Compensation Guidance and Visual Difference Analysis for No-Reference Image Quality Assessment","authors":"Zhaoyang Wang;Bo Hu;Mingyang Zhang;Jie Li;Leida Li;Maoguo Gong;Xinbo Gao","doi":"10.1109/TIP.2024.3523800","DOIUrl":"10.1109/TIP.2024.3523800","url":null,"abstract":"Existing free-energy guided No-Reference Image Quality Assessment (NR-IQA) methods continue to face challenges in effectively restoring complexly distorted images. The features guiding the main network for quality assessment lack interpretability, and efficiently leveraging high-level feature information remains a significant challenge. As a novel class of state-of-the-art (SOTA) generative model, the diffusion model exhibits the capability to model intricate relationships, enhancing image restoration effectiveness. Moreover, the intermediate variables in the denoising iteration process exhibit clearer and more interpretable meanings for high-level visual information guidance. In view of these, we pioneer the exploration of the diffusion model into the domain of NR-IQA. We design a novel diffusion model for enhancing images with various types of distortions, resulting in higher quality and more interpretable high-level visual information. Our experiments demonstrate that the diffusion model establishes a clear mapping relationship between image reconstruction and image quality scores, which the network learns to guide quality assessment. Finally, to fully leverage high-level visual information, we design two complementary visual branches to collaboratively perform quality evaluation. Extensive experiments are conducted on seven public NR-IQA datasets, and the results demonstrate that the proposed model outperforms SOTA methods for NR-IQA. The codes will be available at <uri>https://github.com/handsomewzy/DiffV2IQA</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"263-278"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IEEE Transactions on Image Processing publication information IEEE图像处理汇刊信息

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-06 DOI: 10.1109/TIP.2024.3460568

引用次数: 0

Universal Fine-Grained Visual Categorization by Concept Guided Learning 概念引导学习的通用细粒度视觉分类

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-06 DOI: 10.1109/TIP.2024.3523802

Qi Bi;Beichen Zhou;Wei Ji;Gui-Song Xia

Existing fine-grained visual categorization (FGVC) methods assume that the fine-grained semantics rest in the informative parts of an image. This assumption works well on favorable front-view object-centric images, but can face great challenges in many real-world scenarios, such as scene-centric images (e.g., street view) and adverse viewpoint (e.g., object re-identification, remote sensing). In such scenarios, the mis-/over- feature activation is likely to confuse the part selection and degrade the fine-grained representation. In this paper, we are motivated to design a universal FGVC framework for real-world scenarios. More precisely, we propose a concept guided learning (CGL), which models concepts of a certain fine-grained category as a combination of inherited concepts from its subordinate coarse-grained category and discriminative concepts from its own. The discriminative concepts is utilized to guide the fine-grained representation learning. Specifically, three key steps are designed, namely, concept mining, concept fusion, and concept constraint. On the other hand, to bridge the FGVC dataset gap under scene-centric and adverse viewpoint scenarios, a Fine-grained Land-cover Categorization Dataset (FGLCD) with 59,994 fine-grained samples is proposed. Extensive experiments show the proposed CGL: 1) has a competitive performance on conventional FGVC; 2) achieves state-of-the-art performance on fine-grained aerial scenes & scene-centric street scenes; 3) good generalization on object re-identification and fine-grained aerial object detection. The dataset and source code will be available at https://github.com/BiQiWHU/CGL.

现有的细粒度视觉分类（FGVC）方法假设细粒度语义停留在图像的信息部分。这一假设适用于有利的前视以物体为中心的图像，但在许多现实场景中可能面临巨大挑战，例如以场景为中心的图像（如街景）和不利的视点（如物体重新识别、遥感）。在这种情况下，特征激活过少或过多可能会混淆部件选择并降低细粒度表示。在本文中，我们的动机是为现实世界的场景设计一个通用的FGVC框架。更准确地说，我们提出了一种概念引导学习（CGL），它将某一细粒度类别的概念建模为来自其下属粗粒度类别的继承概念和来自其自己的判别概念的组合。利用判别概念指导细粒度表示学习。具体来说，设计了三个关键步骤，即概念挖掘、概念融合和概念约束。另一方面，为了弥补FGVC在场景中心和不利视点场景下的数据缺口，提出了一个包含59,994个细粒度样本的细粒度土地覆盖分类数据集（FGLCD）。大量实验表明：1)与传统FGVC相比，所提出的CGL具有相当的性能；2)在细粒度航拍场景和以场景为中心的街景上实现了最先进的表现；3)在目标再识别和细粒度航空目标检测方面泛化良好。数据集和源代码可在https://github.com/BiQiWHU/CGL上获得。

{"title":"Universal Fine-Grained Visual Categorization by Concept Guided Learning","authors":"Qi Bi;Beichen Zhou;Wei Ji;Gui-Song Xia","doi":"10.1109/TIP.2024.3523802","DOIUrl":"10.1109/TIP.2024.3523802","url":null,"abstract":"Existing fine-grained visual categorization (FGVC) methods assume that the fine-grained semantics rest in the informative parts of an image. This assumption works well on favorable front-view object-centric images, but can face great challenges in many real-world scenarios, such as scene-centric images (e.g., street view) and adverse viewpoint (e.g., object re-identification, remote sensing). In such scenarios, the mis-/over- feature activation is likely to confuse the part selection and degrade the fine-grained representation. In this paper, we are motivated to design a universal FGVC framework for real-world scenarios. More precisely, we propose a concept guided learning (CGL), which models concepts of a certain fine-grained category as a combination of inherited concepts from its subordinate coarse-grained category and discriminative concepts from its own. The discriminative concepts is utilized to guide the fine-grained representation learning. Specifically, three key steps are designed, namely, concept mining, concept fusion, and concept constraint. On the other hand, to bridge the FGVC dataset gap under scene-centric and adverse viewpoint scenarios, a Fine-grained Land-cover Categorization Dataset (FGLCD) with 59,994 fine-grained samples is proposed. Extensive experiments show the proposed CGL: 1) has a competitive performance on conventional FGVC; 2) achieves state-of-the-art performance on fine-grained aerial scenes & scene-centric street scenes; 3) good generalization on object re-identification and fine-grained aerial object detection. The dataset and source code will be available at <uri>https://github.com/BiQiWHU/CGL</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"394-409"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Constrained Visual Representation Learning With Bisimulation Metrics for Safe Reinforcement Learning 基于双模拟度量的约束视觉表示学习用于安全强化学习

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-06 DOI: 10.1109/TIP.2024.3523798

Rongrong Wang;Yuhu Cheng;Xuesong Wang

Safe reinforcement learning aims to ensure the optimal performance while minimizing potential risks. In real-world applications, especially in scenarios that rely on visual inputs, a key challenge lies in the extraction of essential features for safe decision-making while maintaining the sample efficiency. To address this issue, we propose the constrained visual representation learning with bisimulation metrics for safe reinforcement learning (CVRL-BM). CVRL-BM constructs a sequential conditional variational inference model to compress high-dimensional visual observations into low-dimensional state representations. Additionally, safety bisimulation metrics are introduced to quantify the behavioral similarity between states, and our objective is to make the distance between any two latent state representations as close as possible to the safety bisimulation metric between their corresponding states. By integrating these two components, CVRL-BM is able to learn compact and information-rich visual state representations while satisfying predefined safety constraints. Experiments on Safety Gym show that CVRL-BM outperforms existing vision-based safe reinforcement learning methods in safety and efficacy. Particularly, CVRL-BM surpasses the state-of-the-art Safe SLAC method by achieving a 19.748% higher reward return, a 41.772% lower cost return, and a 5.027% decrease in cost regret. These results highlight the effectiveness of our proposed CVRL-BM.

安全强化学习的目的是在保证最佳性能的同时最小化潜在风险。在现实世界的应用中，特别是在依赖视觉输入的场景中，一个关键的挑战在于在保持样本效率的同时提取安全决策的基本特征。为了解决这个问题，我们提出了安全强化学习的约束视觉表征学习和双模拟度量（CVRL-BM）。CVRL-BM构建了顺序条件变分推理模型，将高维视觉观测压缩为低维状态表示。此外，引入了安全双模拟度量来量化状态之间的行为相似性，我们的目标是使任意两个潜在状态表示之间的距离尽可能接近其对应状态之间的安全双模拟度量。通过集成这两个组件，CVRL-BM能够学习紧凑且信息丰富的视觉状态表示，同时满足预定义的安全约束。在Safety Gym上的实验表明，CVRL-BM在安全性和有效性上都优于现有的基于视觉的安全强化学习方法。特别是，CVRL-BM超越了最先进的Safe SLAC方法，实现了19.748%的高回报，41.772%的低成本回报，5.027%的低成本后悔。这些结果突出了我们提出的CVRL-BM的有效性。

{"title":"Constrained Visual Representation Learning With Bisimulation Metrics for Safe Reinforcement Learning","authors":"Rongrong Wang;Yuhu Cheng;Xuesong Wang","doi":"10.1109/TIP.2024.3523798","DOIUrl":"10.1109/TIP.2024.3523798","url":null,"abstract":"Safe reinforcement learning aims to ensure the optimal performance while minimizing potential risks. In real-world applications, especially in scenarios that rely on visual inputs, a key challenge lies in the extraction of essential features for safe decision-making while maintaining the sample efficiency. To address this issue, we propose the constrained visual representation learning with bisimulation metrics for safe reinforcement learning (CVRL-BM). CVRL-BM constructs a sequential conditional variational inference model to compress high-dimensional visual observations into low-dimensional state representations. Additionally, safety bisimulation metrics are introduced to quantify the behavioral similarity between states, and our objective is to make the distance between any two latent state representations as close as possible to the safety bisimulation metric between their corresponding states. By integrating these two components, CVRL-BM is able to learn compact and information-rich visual state representations while satisfying predefined safety constraints. Experiments on Safety Gym show that CVRL-BM outperforms existing vision-based safe reinforcement learning methods in safety and efficacy. Particularly, CVRL-BM surpasses the state-of-the-art Safe SLAC method by achieving a 19.748% higher reward return, a 41.772% lower cost return, and a 5.027% decrease in cost regret. These results highlight the effectiveness of our proposed CVRL-BM.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"379-393"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Latent Properties to Optimize Neural Codecs 利用潜在特性优化神经编解码器

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2025-01-01 DOI: 10.1109/TIP.2024.3522813

Muhammet Balcilar;Bharath Bhushan Damodaran;Karam Naser;Franck Galpin;Pierre Hellier

End-to-end image and video codecs are becoming increasingly competitive, compared to traditional compression techniques that have been developed through decades of manual engineering efforts. These trainable codecs have many advantages over traditional techniques, such as their straightforward adaptation to perceptual distortion metrics and high performance in specific fields thanks to their learning ability. However, current state-of-the-art neural codecs do not fully exploit the benefits of vector quantization and the existence of the entropy gradient in decoding devices. In this paper, we propose to leverage these two properties (vector quantization and entropy gradient) to improve the performance of off-the-shelf codecs. Firstly, we demonstrate that using non-uniform scalar quantization cannot improve performance over uniform quantization. We thus suggest using predefined optimal uniform vector quantization to improve performance. Secondly, we show that the entropy gradient, available at the decoder, is correlated with the reconstruction error gradient, which is not available at the decoder. We therefore use the former as a proxy to enhance compression performance. Our experimental results show that these approaches save between 1 to 3% of the rate for the same quality across various pre-trained methods. In addition, the entropy gradient based solution improves traditional codec performance significantly as well.

{"title":"Exploiting Latent Properties to Optimize Neural Codecs","authors":"Muhammet Balcilar;Bharath Bhushan Damodaran;Karam Naser;Franck Galpin;Pierre Hellier","doi":"10.1109/TIP.2024.3522813","DOIUrl":"10.1109/TIP.2024.3522813","url":null,"abstract":"End-to-end image and video codecs are becoming increasingly competitive, compared to traditional compression techniques that have been developed through decades of manual engineering efforts. These trainable codecs have many advantages over traditional techniques, such as their straightforward adaptation to perceptual distortion metrics and high performance in specific fields thanks to their learning ability. However, current state-of-the-art neural codecs do not fully exploit the benefits of vector quantization and the existence of the entropy gradient in decoding devices. In this paper, we propose to leverage these two properties (vector quantization and entropy gradient) to improve the performance of off-the-shelf codecs. Firstly, we demonstrate that using non-uniform scalar quantization cannot improve performance over uniform quantization. We thus suggest using predefined optimal uniform vector quantization to improve performance. Secondly, we show that the entropy gradient, available at the decoder, is correlated with the reconstruction error gradient, which is not available at the decoder. We therefore use the former as a proxy to enhance compression performance. Our experimental results show that these approaches save between 1 to 3% of the rate for the same quality across various pre-trained methods. In addition, the entropy gradient based solution improves traditional codec performance significantly as well.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"306-319"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0