IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献_第7页

Real-World Low-Dose CT Image Denoising by Patch Similarity Purification 基于贴片相似纯化的真实世界低剂量CT图像去噪

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-17 DOI: 10.1109/TIP.2024.3515878

Zeya Song;Liqi Xue;Jun Xu;Baoping Zhang;Chao Jin;Jian Yang;Changliang Zou

Reducing the radiation dose in CT scanning is important to alleviate the damage to the human health in clinical scenes. A promising way is to replace the normal-dose CT (NDCT) imaging by low-dose CT (LDCT) imaging with lower tube voltage and tube current. This often brings severe noise to the LDCT images, which adversely affects the diagnosis accuracy. Most of existing LDCT image denoising networks are trained either with synthetic LDCT images or real-world LDCT and NDCT image pairs with huge spatial misalignment. However, the synthetic noise is very different from the complex noise in real-world LDCT images, while the huge spatial misalignment brings inaccurate predictions of tissue structures in the denoised LDCT images. To well utilize real-world LDCT and NDCT image pairs for LDCT image denoising, in this paper, we introduce a new Patch Similarity Purification (PSP) strategy to construct high-quality training dataset for network training. Specifically, our PSP strategy first perform binarization for each pair of image patches cropped from the corresponding LDCT and NDCT image pairs. For each pair of binary masks, it then computes their similarity ratio by common mask calculation, and the patch pair can be selected as a training sample if their mask similarity ratio is higher than a threshold. By using our PSP strategy, each training set of our Rabbit and Patient datasets contain hundreds of thousands of real-world LDCT and NDCT image patch pairs with negligible misalignment. Extensive experiments demonstrate the usefulness of our PSP strategy on purifying the training data and the effectiveness of training LDCT image denoising networks on our datasets. The code and dataset are provided at https://github.com/TuTusong/PSP.

{"title":"Real-World Low-Dose CT Image Denoising by Patch Similarity Purification","authors":"Zeya Song;Liqi Xue;Jun Xu;Baoping Zhang;Chao Jin;Jian Yang;Changliang Zou","doi":"10.1109/TIP.2024.3515878","DOIUrl":"10.1109/TIP.2024.3515878","url":null,"abstract":"Reducing the radiation dose in CT scanning is important to alleviate the damage to the human health in clinical scenes. A promising way is to replace the normal-dose CT (NDCT) imaging by low-dose CT (LDCT) imaging with lower tube voltage and tube current. This often brings severe noise to the LDCT images, which adversely affects the diagnosis accuracy. Most of existing LDCT image denoising networks are trained either with synthetic LDCT images or real-world LDCT and NDCT image pairs with huge spatial misalignment. However, the synthetic noise is very different from the complex noise in real-world LDCT images, while the huge spatial misalignment brings inaccurate predictions of tissue structures in the denoised LDCT images. To well utilize real-world LDCT and NDCT image pairs for LDCT image denoising, in this paper, we introduce a new Patch Similarity Purification (PSP) strategy to construct high-quality training dataset for network training. Specifically, our PSP strategy first perform binarization for each pair of image patches cropped from the corresponding LDCT and NDCT image pairs. For each pair of binary masks, it then computes their similarity ratio by common mask calculation, and the patch pair can be selected as a training sample if their mask similarity ratio is higher than a threshold. By using our PSP strategy, each training set of our Rabbit and Patient datasets contain hundreds of thousands of real-world LDCT and NDCT image patch pairs with negligible misalignment. Extensive experiments demonstrate the usefulness of our PSP strategy on purifying the training data and the effectiveness of training LDCT image denoising networks on our datasets. The code and dataset are provided at <uri>https://github.com/TuTusong/PSP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"196-208"},"PeriodicalIF":0.0,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142840915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Key-Axis-Based Localization of Symmetry Axes in 3D Objects Utilizing Geometry and Texture 基于键轴的三维物体对称轴几何和纹理定位

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-17 DOI: 10.1109/TIP.2024.3515801

Yulin Wang;Chen Luo

In pose estimation for objects with rotational symmetry, ambiguous poses may arise, and the symmetry axes of objects are crucial for eliminating such ambiguities. Currently, in pose estimation, reliance on manual settings of symmetry axes decreases the accuracy of pose estimation. To address this issue, this method proposes determining the orders of symmetry axes and angles between axes based on a given rotational symmetry type or polyhedron, reducing the need for manual settings of symmetry axes. Subsequently, two key axes with the highest orders are defined and localized, then three orthogonal axes are generated based on key axes, while each symmetry axis can be computed utilizing orthogonal axes. Compared to localizing symmetry axes one by one, the key-axis-based symmetry axis localization is more efficient. To support geometric and texture symmetry, the method utilizes the ADI metric for key axis localization in geometrically symmetric objects and proposes a novel metric, ADI-C, for objects with texture symmetry. Experimental results on the LM-O and HB datasets demonstrate a 9.80% reduction in symmetry axis localization error and a 1.64% improvement in pose estimation accuracy. Additionally, the method introduces a new dataset, DSRSTO, to illustrate its performance across seven types of geometrically and texturally symmetric objects. The GitHub link for the open-source tool based on this method is https://github.com/WangYuLin-SEU/KASAL.

在旋转对称物体的姿态估计中，可能会出现姿态模糊，而物体的对称轴对于消除这种模糊性至关重要。目前，在姿态估计中，依赖于手动设置对称轴降低了姿态估计的精度。为了解决这个问题，该方法提出了根据给定的旋转对称类型或多面体确定对称轴的顺序和轴之间的角度，减少了手动设置对称轴的需要。然后定义并定域两个最高阶键轴，然后基于键轴生成三个正交轴，利用正交轴计算每个对称轴。与逐个定位对称轴相比，基于键轴的对称轴定位效率更高。为了支持几何和纹理对称，该方法利用ADI度量对几何对称对象进行关键轴定位，并提出了一种新的度量ADI- c，用于纹理对称对象。在LM-O和HB数据集上的实验结果表明，对称轴定位误差降低了9.80%，姿态估计精度提高了1.64%。此外，该方法引入了一个新的数据集DSRSTO，以说明其在七种几何和纹理对称对象上的性能。基于此方法的开源工具的GitHub链接是https://github.com/WangYuLin-SEU/KASAL。

{"title":"Key-Axis-Based Localization of Symmetry Axes in 3D Objects Utilizing Geometry and Texture","authors":"Yulin Wang;Chen Luo","doi":"10.1109/TIP.2024.3515801","DOIUrl":"10.1109/TIP.2024.3515801","url":null,"abstract":"In pose estimation for objects with rotational symmetry, ambiguous poses may arise, and the symmetry axes of objects are crucial for eliminating such ambiguities. Currently, in pose estimation, reliance on manual settings of symmetry axes decreases the accuracy of pose estimation. To address this issue, this method proposes determining the orders of symmetry axes and angles between axes based on a given rotational symmetry type or polyhedron, reducing the need for manual settings of symmetry axes. Subsequently, two key axes with the highest orders are defined and localized, then three orthogonal axes are generated based on key axes, while each symmetry axis can be computed utilizing orthogonal axes. Compared to localizing symmetry axes one by one, the key-axis-based symmetry axis localization is more efficient. To support geometric and texture symmetry, the method utilizes the ADI metric for key axis localization in geometrically symmetric objects and proposes a novel metric, ADI-C, for objects with texture symmetry. Experimental results on the LM-O and HB datasets demonstrate a 9.80% reduction in symmetry axis localization error and a 1.64% improvement in pose estimation accuracy. Additionally, the method introduces a new dataset, DSRSTO, to illustrate its performance across seven types of geometrically and texturally symmetric objects. The GitHub link for the open-source tool based on this method is \u0000<uri>https://github.com/WangYuLin-SEU/KASAL</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6720-6733"},"PeriodicalIF":0.0,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142840914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning 通过伪监督学习利用无标记视频进行视频文本检索

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-16 DOI: 10.1109/TIP.2024.3514352

Yu Lu;Ruijie Quan;Linchao Zhu;Yi Yang

Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.

大规模预训练的视觉语言模型（如CLIP）在下游任务（如视频文本检索（VTR））中显示出令人难以置信的泛化性能。传统的方法是利用CLIP强大的多模态校准能力，直接微调视觉和文本编码器与干净的视频文本数据。然而，这些技术依赖于仔细注释的视频文本对，这是昂贵的，需要大量的人工努力。在此背景下，我们引入了一种新的方法——伪监督选择性对比学习（PS-SCL）。PS-SCL通过从未标记的视频数据生成用于训练的伪监督，最大限度地减少了对手动标记文本注释的依赖。我们首先利用CLIP的视觉识别能力自动生成伪文本。这些伪文本包含了来自视频的各种视觉概念，并起到了微弱的文本引导作用。此外，我们引入了选择性对比学习（SeLeCT），它从伪监督视频文本对中优先选择高度相关的视频文本对。通过这样做，SeLeCT可以在弱配对监督下更有效地进行多模态学习。实验结果表明，我们的方法在多个视频文本检索基准上大大优于CLIP零射击性能，例如，MSRVTT上的视频到文本分别为8.2% R@1， DiDeMo上的视频到文本分别为12.2% R@1， ActivityNet上的视频到文本分别为10.9% R@1。

{"title":"Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning","authors":"Yu Lu;Ruijie Quan;Linchao Zhu;Yi Yang","doi":"10.1109/TIP.2024.3514352","DOIUrl":"10.1109/TIP.2024.3514352","url":null,"abstract":"Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6748-6760"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diffusion Models as Strong Adversaries 作为强大对手的扩散模型

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-16 DOI: 10.1109/TIP.2024.3514361

Xuelong Dai;Yanjie Li;Mingxing Duan;Bin Xiao

Diffusion models have demonstrated their great ability to generate high-quality images for various tasks. With such a strong performance, diffusion models can potentially pose a severe threat to both humans and deep learning models, e.g., DNNs and MLLMs. However, their abilities as adversaries have not been well explored. Among different adversarial scenarios, the no-box adversarial attack is the most practical one, as it assumes that the attacker has no access to the training dataset or the target model. Existing works still require some data from the training dataset, which may not be feasible in real-world scenarios. In this paper, we investigate the adversarial capabilities of diffusion models by conducting no-box attacks solely using data generated by diffusion models. Specifically, our attack method generates a synthetic dataset using diffusion models to train a substitute model. We then employ a classification diffusion model to fine-tune the substitute model, considering model uncertainty and incorporating noise augmentation. Finally, we sample adversarial examples from the diffusion models using the average approximation over the diffusion substitute model with multiple inferences. Extensive experiments on the ImageNet dataset demonstrate that the proposed attack method achieves state-of-the-art performance in both no-box attack and black-box attack scenarios.

扩散模型已经证明了它们为各种任务生成高质量图像的强大能力。由于具有如此强大的性能，扩散模型可能对人类和深度学习模型（例如dnn和mlm）构成严重威胁。然而，他们作为对手的能力还没有得到很好的探索。在不同的对抗场景中，无箱对抗攻击是最实用的一种，因为它假设攻击者无法访问训练数据集或目标模型。现有的工作仍然需要来自训练数据集的一些数据，这在现实场景中可能是不可行的。在本文中，我们通过仅使用扩散模型生成的数据进行无箱攻击来研究扩散模型的对抗能力。具体来说，我们的攻击方法使用扩散模型生成一个合成数据集来训练替代模型。然后，我们使用一个分类扩散模型来微调替代模型，考虑模型的不确定性并加入噪声增强。最后，我们从扩散模型中使用具有多个推论的扩散替代模型的平均近似来采样对抗示例。在ImageNet数据集上的大量实验表明，所提出的攻击方法在无盒攻击和黑盒攻击场景下都达到了最先进的性能。

{"title":"Diffusion Models as Strong Adversaries","authors":"Xuelong Dai;Yanjie Li;Mingxing Duan;Bin Xiao","doi":"10.1109/TIP.2024.3514361","DOIUrl":"10.1109/TIP.2024.3514361","url":null,"abstract":"Diffusion models have demonstrated their great ability to generate high-quality images for various tasks. With such a strong performance, diffusion models can potentially pose a severe threat to both humans and deep learning models, e.g., DNNs and MLLMs. However, their abilities as adversaries have not been well explored. Among different adversarial scenarios, the no-box adversarial attack is the most practical one, as it assumes that the attacker has no access to the training dataset or the target model. Existing works still require some data from the training dataset, which may not be feasible in real-world scenarios. In this paper, we investigate the adversarial capabilities of diffusion models by conducting no-box attacks solely using data generated by diffusion models. Specifically, our attack method generates a synthetic dataset using diffusion models to train a substitute model. We then employ a classification diffusion model to fine-tune the substitute model, considering model uncertainty and incorporating noise augmentation. Finally, we sample adversarial examples from the diffusion models using the average approximation over the diffusion substitute model with multiple inferences. Extensive experiments on the ImageNet dataset demonstrate that the proposed attack method achieves state-of-the-art performance in both no-box attack and black-box attack scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6734-6747"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unsupervised Learning of Intrinsic Semantics With Diffusion Model for Person Re-Identification 利用扩散模型对内在语义进行无监督学习，实现人员再识别

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-16 DOI: 10.1109/TIP.2024.3514360

Xuefeng Tao;Jun Kong;Min Jiang;Ming Lu;Ajmal Mian

Unsupervised person re-identification (Re-ID) aims to learn semantic representations for person retrieval without using identity labels. Most existing methods generate fine-grained patch features to reduce noise in global feature clustering. However, these methods often compromise the discriminative semantic structure and overlook the semantic consistency between the patch and global features. To address these problems, we propose a Person Intrinsic Semantic Learning (PISL) framework with diffusion model for unsupervised person Re-ID. First, we design the Spatial Diffusion Model (SDM), which performs a denoising diffusion process from noisy spatial transformer parameters to semantic parameters, enabling the sampling of patches with intrinsic semantic structure. Second, we propose the Semantic Controlled Diffusion (SCD) loss to guide the denoising direction of the diffusion model, facilitating the generation of semantic patches. Third, we propose the Patch Semantic Consistency (PSC) loss to capture semantic consistency between the patch and global features, refining the pseudo-labels of global features. Comprehensive experiments on three challenging datasets show that our method surpasses current unsupervised Re-ID methods. The source code will be publicly available at https://github.com/taoxuefong/Diffusion-reid

无监督人员再识别（Re-ID）旨在学习语义表示，以便在不使用身份标签的情况下进行人员检索。现有的方法大多是通过生成细粒度的patch特征来降低全局特征聚类中的噪声。然而，这些方法往往损害了判别语义结构，忽略了补丁和全局特征之间的语义一致性。为了解决这些问题，我们提出了一个带有扩散模型的无监督人再识别的人内在语义学习（PISL）框架。首先，我们设计了空间扩散模型（Spatial Diffusion Model， SDM），该模型进行了从有噪声的空间变压器参数到语义参数的去噪扩散过程，实现了对具有固有语义结构的斑块的采样。其次，我们提出了语义控制扩散（Semantic Controlled Diffusion， SCD）损失来指导扩散模型的去噪方向，促进语义补丁的生成。第三，提出补丁语义一致性（Patch Semantic Consistency， PSC）损失来捕获补丁与全局特征之间的语义一致性，对全局特征的伪标签进行细化。在三个具有挑战性的数据集上进行的综合实验表明，我们的方法优于当前的无监督Re-ID方法。源代码将在https://github.com/taoxuefong/Diffusion-reid上公开提供

{"title":"Unsupervised Learning of Intrinsic Semantics With Diffusion Model for Person Re-Identification","authors":"Xuefeng Tao;Jun Kong;Min Jiang;Ming Lu;Ajmal Mian","doi":"10.1109/TIP.2024.3514360","DOIUrl":"10.1109/TIP.2024.3514360","url":null,"abstract":"Unsupervised person re-identification (Re-ID) aims to learn semantic representations for person retrieval without using identity labels. Most existing methods generate fine-grained patch features to reduce noise in global feature clustering. However, these methods often compromise the discriminative semantic structure and overlook the semantic consistency between the patch and global features. To address these problems, we propose a Person Intrinsic Semantic Learning (PISL) framework with diffusion model for unsupervised person Re-ID. First, we design the Spatial Diffusion Model (SDM), which performs a denoising diffusion process from noisy spatial transformer parameters to semantic parameters, enabling the sampling of patches with intrinsic semantic structure. Second, we propose the Semantic Controlled Diffusion (SCD) loss to guide the denoising direction of the diffusion model, facilitating the generation of semantic patches. Third, we propose the Patch Semantic Consistency (PSC) loss to capture semantic consistency between the patch and global features, refining the pseudo-labels of global features. Comprehensive experiments on three challenging datasets show that our method surpasses current unsupervised Re-ID methods. The source code will be publicly available at \u0000<uri>https://github.com/taoxuefong/Diffusion-reid</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6705-6719"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Self-Adaptive Feature Extraction Method for Aerial-View Geo-Localization 一种鸟瞰图地理定位的自适应特征提取方法

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-12 DOI: 10.1109/TIP.2024.3513157

Jinliang Lin;Zhiming Luo;Dazhen Lin;Shaozi Li;Zhun Zhong

Cross-view geo-localization aims to match the same geographic location from different view images, e.g., drone-view images and geo-referenced satellite-view images. Due to UAV cameras’ different shooting angles and heights, the scale of the same captured target building in the drone-view images varies greatly. Meanwhile, there is a difference in size and floor area for different geographic locations in the real world, such as towers and stadiums, which also leads to scale variants of geographic targets in the images. However, existing methods mainly focus on extracting the fine-grained information of the geographic targets or the contextual information of the surrounding area, which overlook the robust feature for scale changes and the importance of feature alignment. In this study, we argue that the key underpinning of this task is to train a network to mine a discriminative representation against scale variants. To this end, we design an effective and novel end-to-end network called Self-Adaptive Feature Extraction Network (Safe-Net) to extract powerful scale-invariant features in a self-adaptive manner. Safe-Net includes a global representation-guided feature alignment module and a saliency-guided feature partition module. The former applies an affine transformation guided by the global feature for adaptive feature alignment. Without extra region annotations, the latter computes saliency distribution for different regions of the image and adopts the saliency information to guide a self-adaptive feature partition on the feature map to learn a visual representation against scale variants. Experiments on two prevailing large-scale aerial-view geo-localization benchmarks, i.e., University-1652 and SUES-200, show that the proposed method achieves state-of-the-art results. In addition, our proposed Safe-Net has a significant scale adaptive capability and can extract robust feature representations for those query images with small target buildings. The source code of this study is available at: https://github.com/AggMan96/Safe-Net.

跨视点地理定位的目的是匹配不同视点图像中的同一地理位置，例如无人机视点图像和地理参考卫星视点图像。由于无人机摄像机拍摄角度和高度的不同，同一目标建筑物在无人机视角图像中的尺度差异很大。同时，现实世界中不同的地理位置，如高楼和体育馆，其大小和占地面积也存在差异，这也导致了图像中地理目标的尺度变化。然而，现有方法主要侧重于提取地理目标的细粒度信息或周边区域的上下文信息，忽略了尺度变化的鲁棒性特征和特征对齐的重要性。在这项研究中，我们认为这项任务的关键基础是训练一个网络来挖掘针对尺度变量的判别表示。为此，我们设计了一种有效的、新颖的端到端网络——自适应特征提取网络（Safe-Net），以自适应的方式提取强大的尺度不变特征。安全网包括一个全局表示导向的特征对齐模块和一个显著性导向的特征划分模块。前者采用全局特征引导下的仿射变换进行自适应特征对齐。后者在没有额外区域标注的情况下，计算图像不同区域的显著性分布，并利用显著性信息在特征映射上引导自适应特征分区，学习针对尺度变量的视觉表示。在两个主流的大规模鸟瞰图地理定位基准（即University-1652和SUES-200）上的实验表明，所提出的方法取得了最先进的结果。此外，我们提出的安全网具有显著的规模自适应能力，可以提取具有小目标建筑物的查询图像的鲁棒特征表示。本研究的源代码可在：https://github.com/AggMan96/Safe-Net。

{"title":"A Self-Adaptive Feature Extraction Method for Aerial-View Geo-Localization","authors":"Jinliang Lin;Zhiming Luo;Dazhen Lin;Shaozi Li;Zhun Zhong","doi":"10.1109/TIP.2024.3513157","DOIUrl":"10.1109/TIP.2024.3513157","url":null,"abstract":"Cross-view geo-localization aims to match the same geographic location from different view images, e.g., drone-view images and geo-referenced satellite-view images. Due to UAV cameras’ different shooting angles and heights, the scale of the same captured target building in the drone-view images varies greatly. Meanwhile, there is a difference in size and floor area for different geographic locations in the real world, such as towers and stadiums, which also leads to scale variants of geographic targets in the images. However, existing methods mainly focus on extracting the fine-grained information of the geographic targets or the contextual information of the surrounding area, which overlook the robust feature for scale changes and the importance of feature alignment. In this study, we argue that the key underpinning of this task is to train a network to mine a discriminative representation against scale variants. To this end, we design an effective and novel end-to-end network called Self-Adaptive Feature Extraction Network (Safe-Net) to extract powerful scale-invariant features in a self-adaptive manner. Safe-Net includes a global representation-guided feature alignment module and a saliency-guided feature partition module. The former applies an affine transformation guided by the global feature for adaptive feature alignment. Without extra region annotations, the latter computes saliency distribution for different regions of the image and adopts the saliency information to guide a self-adaptive feature partition on the feature map to learn a visual representation against scale variants. Experiments on two prevailing large-scale aerial-view geo-localization benchmarks, i.e., University-1652 and SUES-200, show that the proposed method achieves state-of-the-art results. In addition, our proposed Safe-Net has a significant scale adaptive capability and can extract robust feature representations for those query images with small target buildings. The source code of this study is available at: \u0000<uri>https://github.com/AggMan96/Safe-Net</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"126-139"},"PeriodicalIF":0.0,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VDMUFusion: A Versatile Diffusion Model-Based Unsupervised Framework for Image Fusion VDMUFusion：一种基于扩散模型的多用途无监督图像融合框架

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-12 DOI: 10.1109/TIP.2024.3512365

Yu Shi;Yu Liu;Juan Cheng;Z. Jane Wang;Xun Chen

Image fusion facilitates the integration of information from various source images of the same scene into a composite image, thereby benefiting perception, analysis, and understanding. Recently, diffusion models have demonstrated impressive generative capabilities in the field of computer vision, suggesting significant potential for application in image fusion. The forward process in the diffusion models requires the gradual addition of noise to the original data. However, typical unsupervised image fusion tasks (e.g., infrared-visible, medical, and multi-exposure image fusion) lack ground truth images (corresponding to the original data in diffusion models), thereby preventing the direct application of the diffusion models. To address this problem, we propose a versatile diffusion model-based unsupervised framework for image fusion, termed as VDMUFusion. In the proposed method, we integrate the fusion problem into the diffusion sampling process by formulating image fusion as a weighted average process and establishing appropriate assumptions about the noise in the diffusion model. To simplify the training process, we propose a multi-task learning framework that replaces the original noise prediction network, allowing for simultaneous prediction of noise and fusion weights. Meanwhile, our method employs joint training across various fusion tasks, which significantly improves noise prediction accuracy and yields higher quality fused images compared to training on a single task. Extensive experimental results demonstrate that the proposed method delivers very competitive performance across various image fusion tasks. The code is available at https://github.com/yuliu316316/VDMUFusion.

图像融合有助于将来自同一场景的各种源图像的信息集成为合成图像，从而有利于感知、分析和理解。最近，扩散模型在计算机视觉领域展示了令人印象深刻的生成能力，这表明在图像融合方面的应用具有巨大的潜力。扩散模型的正演过程需要在原始数据中逐渐加入噪声。然而，典型的无监督图像融合任务（如红外-可见光、医学和多曝光图像融合）缺乏地面真值图像（对应于扩散模型中的原始数据），从而阻碍了扩散模型的直接应用。为了解决这个问题，我们提出了一个通用的基于扩散模型的无监督图像融合框架，称为VDMUFusion。在该方法中，我们通过将图像融合表述为加权平均过程，并对扩散模型中的噪声建立适当的假设，将融合问题整合到扩散采样过程中。为了简化训练过程，我们提出了一个多任务学习框架来取代原始的噪声预测网络，允许同时预测噪声和融合权值。同时，我们的方法采用跨多个融合任务的联合训练，与单一任务的训练相比，显著提高了噪声预测的精度，产生了更高质量的融合图像。大量的实验结果表明，该方法在各种图像融合任务中具有很强的竞争力。代码可在https://github.com/yuliu316316/VDMUFusion上获得。

{"title":"VDMUFusion: A Versatile Diffusion Model-Based Unsupervised Framework for Image Fusion","authors":"Yu Shi;Yu Liu;Juan Cheng;Z. Jane Wang;Xun Chen","doi":"10.1109/TIP.2024.3512365","DOIUrl":"10.1109/TIP.2024.3512365","url":null,"abstract":"Image fusion facilitates the integration of information from various source images of the same scene into a composite image, thereby benefiting perception, analysis, and understanding. Recently, diffusion models have demonstrated impressive generative capabilities in the field of computer vision, suggesting significant potential for application in image fusion. The forward process in the diffusion models requires the gradual addition of noise to the original data. However, typical unsupervised image fusion tasks (e.g., infrared-visible, medical, and multi-exposure image fusion) lack ground truth images (corresponding to the original data in diffusion models), thereby preventing the direct application of the diffusion models. To address this problem, we propose a versatile diffusion model-based unsupervised framework for image fusion, termed as VDMUFusion. In the proposed method, we integrate the fusion problem into the diffusion sampling process by formulating image fusion as a weighted average process and establishing appropriate assumptions about the noise in the diffusion model. To simplify the training process, we propose a multi-task learning framework that replaces the original noise prediction network, allowing for simultaneous prediction of noise and fusion weights. Meanwhile, our method employs joint training across various fusion tasks, which significantly improves noise prediction accuracy and yields higher quality fused images compared to training on a single task. Extensive experimental results demonstrate that the proposed method delivers very competitive performance across various image fusion tasks. The code is available at <uri>https://github.com/yuliu316316/VDMUFusion</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"441-454"},"PeriodicalIF":0.0,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Lossless Compression for High Bit-Depth Volumetric Medical Image 高位深体积医学图像的学习无损压缩

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-12 DOI: 10.1109/TIP.2024.3513156

Kai Wang;Yuanchao Bai;Daxin Li;Deming Zhai;Junjun Jiang;Xianming Liu

Recent advances in learning-based methods have markedly enhanced the capabilities of image compression. However, these methods struggle with high bit-depth volumetric medical images, facing issues such as degraded performance, increased memory demand, and reduced processing speed. To address these challenges, this paper presents the Bit-Division based Lossless Volumetric Image Compression (BD-LVIC) framework, which is tailored for high bit-depth medical volume compression. The BD-LVIC framework skillfully divides the high bit-depth volume into two lower bit-depth segments: the Most Significant Bit-Volume (MSBV) and the Least Significant Bit-Volume (LSBV). The MSBV concentrates on the most significant bits of the volumetric medical image, capturing vital structural details in a compact manner. This reduction in complexity greatly improves compression efficiency using traditional codecs. Conversely, the LSBV deals with the least significant bits, which encapsulate intricate texture details. To compress this detailed information effectively, we introduce an effective learning-based compression model equipped with a Transformer-Based Feature Alignment Module, which exploits both intra-slice and inter-slice redundancies to accurately align features. Subsequently, a Parallel Autoregressive Coding Module merges these features to precisely estimate the probability distribution of the least significant bit-planes. Our extensive testing demonstrates that the BD-LVIC framework not only sets new performance benchmarks across various datasets but also maintains a competitive coding speed, highlighting its significant potential and practical utility in the realm of volumetric medical image compression.

{"title":"Learning Lossless Compression for High Bit-Depth Volumetric Medical Image","authors":"Kai Wang;Yuanchao Bai;Daxin Li;Deming Zhai;Junjun Jiang;Xianming Liu","doi":"10.1109/TIP.2024.3513156","DOIUrl":"10.1109/TIP.2024.3513156","url":null,"abstract":"Recent advances in learning-based methods have markedly enhanced the capabilities of image compression. However, these methods struggle with high bit-depth volumetric medical images, facing issues such as degraded performance, increased memory demand, and reduced processing speed. To address these challenges, this paper presents the Bit-Division based Lossless Volumetric Image Compression (BD-LVIC) framework, which is tailored for high bit-depth medical volume compression. The BD-LVIC framework skillfully divides the high bit-depth volume into two lower bit-depth segments: the Most Significant Bit-Volume (MSBV) and the Least Significant Bit-Volume (LSBV). The MSBV concentrates on the most significant bits of the volumetric medical image, capturing vital structural details in a compact manner. This reduction in complexity greatly improves compression efficiency using traditional codecs. Conversely, the LSBV deals with the least significant bits, which encapsulate intricate texture details. To compress this detailed information effectively, we introduce an effective learning-based compression model equipped with a Transformer-Based Feature Alignment Module, which exploits both intra-slice and inter-slice redundancies to accurately align features. Subsequently, a Parallel Autoregressive Coding Module merges these features to precisely estimate the probability distribution of the least significant bit-planes. Our extensive testing demonstrates that the BD-LVIC framework not only sets new performance benchmarks across various datasets but also maintains a competitive coding speed, highlighting its significant potential and practical utility in the realm of volumetric medical image compression.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"113-125"},"PeriodicalIF":0.0,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural Network 基于可匹配关键点辅助图神经网络的特征匹配学习

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-11 DOI: 10.1109/TIP.2024.3512352

Zizhuo Li;Jiayi Ma

Accurately matching local features between a pair of images corresponding to the same 3D scene is a challenging computer vision task. Previous studies typically utilize attention-based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images for visual and geometric information reasoning. However, in the background of local feature matching, a significant number of keypoints are non-repeatable due to factors like occlusion and failure of the detector, and thus irrelevant for message passing. The connectivity with non-repeatable keypoints not only introduces redundancy, resulting in limited efficiency (quadratic computational complexity w.r.t. the keypoint number), but also interferes with the representation aggregation process, leading to limited accuracy. Aiming at the best of both worlds on accuracy and efficiency, we propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide compact and meaningful message passing. More specifically, our Bilateral Context-Aware Sampling (BCAS) Module first dynamically samples two small sets of well-distributed keypoints with high matchability scores from the image pair. Then, our Matchable Keypoint-Assisted Context Aggregation (MKACA) Module regards sampled informative keypoints as message bottlenecks and thus constrains each keypoint only to retrieve favorable contextual information from intra- and inter-matchable keypoints, evading the interference of irrelevant and redundant connectivity with non-repeatable ones. Furthermore, considering the potential noise in initial keypoints and sampled matchable ones, the MKACA module adopts a matchability-guided attentional aggregation operation for purer data-dependent context propagation. By these means, MaKeGNN outperforms the state-of-the-arts on multiple highly challenging benchmarks, while significantly reducing computational and memory complexity compared to typical attentional GNNs.

准确匹配对应于同一3D场景的一对图像之间的局部特征是一项具有挑战性的计算机视觉任务。先前的研究通常利用基于注意力的图神经网络（gnn），在图像内/跨图像的关键点上建立全连接图，用于视觉和几何信息推理。然而，在局部特征匹配的背景下，由于遮挡和检测器失效等因素，大量的关键点是不可重复的，因此与消息传递无关。与不可重复关键点的连接不仅会引入冗余，导致效率有限（与关键点数量相关的二次计算复杂度），而且还会干扰表示聚合过程，导致精度有限。为了在准确性和效率上达到两全其美，我们提出了MaKeGNN，这是一种稀疏的基于注意力的GNN架构，它绕过了不可重复的关键点，并利用匹配的关键点来指导紧凑而有意义的消息传递。更具体地说，我们的双边上下文感知采样（BCAS）模块首先从图像对中动态采样两组分布良好且具有高匹配分数的小关键点。然后，我们的匹配关键点辅助上下文聚合（MKACA）模块将采样的信息关键点视为消息瓶颈，从而约束每个关键点仅从内部和内部匹配的关键点中检索有利的上下文信息，避免了与不可重复的不相关和冗余连接的干扰。此外，考虑到初始关键点和采样匹配点的潜在噪声，MKACA模块采用匹配引导的注意力聚合操作，实现更纯粹的数据依赖上下文传播。通过这些方法，MaKeGNN在多个极具挑战性的基准测试中表现优于最先进的技术，同时与典型的注意力gnn相比，显著降低了计算和内存复杂性。

{"title":"Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural Network","authors":"Zizhuo Li;Jiayi Ma","doi":"10.1109/TIP.2024.3512352","DOIUrl":"10.1109/TIP.2024.3512352","url":null,"abstract":"Accurately matching local features between a pair of images corresponding to the same 3D scene is a challenging computer vision task. Previous studies typically utilize attention-based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images for visual and geometric information reasoning. However, in the background of local feature matching, a significant number of keypoints are non-repeatable due to factors like occlusion and failure of the detector, and thus irrelevant for message passing. The connectivity with non-repeatable keypoints not only introduces redundancy, resulting in limited efficiency (quadratic computational complexity w.r.t. the keypoint number), but also interferes with the representation aggregation process, leading to limited accuracy. Aiming at the best of both worlds on accuracy and efficiency, we propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide compact and meaningful message passing. More specifically, our Bilateral Context-Aware Sampling (BCAS) Module first dynamically samples two small sets of well-distributed keypoints with high matchability scores from the image pair. Then, our Matchable Keypoint-Assisted Context Aggregation (MKACA) Module regards sampled informative keypoints as message bottlenecks and thus constrains each keypoint only to retrieve favorable contextual information from intra- and inter-matchable keypoints, evading the interference of irrelevant and redundant connectivity with non-repeatable ones. Furthermore, considering the potential noise in initial keypoints and sampled matchable ones, the MKACA module adopts a matchability-guided attentional aggregation operation for purer data-dependent context propagation. By these means, MaKeGNN outperforms the state-of-the-arts on multiple highly challenging benchmarks, while significantly reducing computational and memory complexity compared to typical attentional GNNs.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"154-169"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Subjective and Objective Analysis of Indian Social Media Video Quality 印度社交媒体视频质量的主客观分析

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-11 DOI: 10.1109/TIP.2024.3512376

Sandeep Mishra;Mukul Jha;Alan C. Bovik

We conducted a large-scale subjective study of the perceptual quality of User-Generated Mobile Video Content on a set of mobile-originated videos obtained from ShareChat, a social media platform widely used across India. The content viewed by volunteer human subjects under controlled laboratory conditions has the benefit of culturally diversifying the existing corpus of User-Generated Content (UGC) video quality datasets. There is a great need for large and diverse UGC-VQA datasets, given the explosive global growth of the visual internet and social media platforms. This is particularly true in regard to videos obtained by smartphones, especially in rapidly emerging economies like India. ShareChat provides a safe and cultural community oriented space for users to generate and share content in their preferred Indian languages and dialects. Our subjective quality study, which is based on this data, supplies much needed cultural, visual, and language diversification to the overall shareable corpus of video quality data. We expect that this new data resource will also allow for the development of systems that can predict the perceived visual quality of Indian social media videos, and in this context, control scaling and compression protocols for streaming, provide better user recommendations, and guide content analysis and processing. We demonstrate the value of the new data resource by conducting a study of leading No-Reference Video Quality Assessment (NR-VQA) models on it, including a simple new model, called MoEVA, which deploys a mixture of experts to predict video quality. Both the new LIVE-ShareChat Database and sample source code for MoEVA are being made freely available to the research community at https://github.com/sandeep-sm/LIVE-SC.

{"title":"Subjective and Objective Analysis of Indian Social Media Video Quality","authors":"Sandeep Mishra;Mukul Jha;Alan C. Bovik","doi":"10.1109/TIP.2024.3512376","DOIUrl":"10.1109/TIP.2024.3512376","url":null,"abstract":"We conducted a large-scale subjective study of the perceptual quality of User-Generated Mobile Video Content on a set of mobile-originated videos obtained from ShareChat, a social media platform widely used across India. The content viewed by volunteer human subjects under controlled laboratory conditions has the benefit of culturally diversifying the existing corpus of User-Generated Content (UGC) video quality datasets. There is a great need for large and diverse UGC-VQA datasets, given the explosive global growth of the visual internet and social media platforms. This is particularly true in regard to videos obtained by smartphones, especially in rapidly emerging economies like India. ShareChat provides a safe and cultural community oriented space for users to generate and share content in their preferred Indian languages and dialects. Our subjective quality study, which is based on this data, supplies much needed cultural, visual, and language diversification to the overall shareable corpus of video quality data. We expect that this new data resource will also allow for the development of systems that can predict the perceived visual quality of Indian social media videos, and in this context, control scaling and compression protocols for streaming, provide better user recommendations, and guide content analysis and processing. We demonstrate the value of the new data resource by conducting a study of leading No-Reference Video Quality Assessment (NR-VQA) models on it, including a simple new model, called MoEVA, which deploys a mixture of experts to predict video quality. Both the new LIVE-ShareChat Database and sample source code for MoEVA are being made freely available to the research community at <uri>https://github.com/sandeep-sm/LIVE-SC</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"140-153"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0