首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
Meta-TIP: An Unsupervised End-to-End Fusion Network for Multi-Dataset Style-Adaptive Threat Image Projection 元提示:用于多数据集风格自适应威胁图像投影的无监督端到端融合网络
IF 13.7 Pub Date : 2025-12-19 DOI: 10.1109/TIP.2025.3609135
Bowen Ma;Tong Jia;Hao Wang;Dongyue Chen
Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.
威胁图像投影(TIP)是一种方便有效的扩展行李x射线图像的方法,对于培训安检人员和计算机辅助检查系统都是必不可少的。现有的方法主要分为两类:基于x射线成像原理的方法和基于gan的生成方法。前者将违禁物品的获取和投影作为两个单独的步骤,很少考虑来自不同数据集的源违禁物品和目标x射线图像之间的风格一致性,使得其在实际应用中的灵活性和可靠性降低。尽管基于gan的方法可以直接在目标图像上生成视觉上一致的违禁物品,但其训练不稳定,缺乏可解释性,严重影响了生成物品的质量。为了克服这些限制,我们提出了一个概念简单、灵活和无监督的端到端TIP框架,称为Meta-TIP,它以风格自适应的方式将从源图像中提取的禁止项叠加到目标图像上。具体来说,Meta-TIP主要应用了三个创新:1)利用一种新的前景-背景对比损失,从杂乱的源图像中重建纯违禁物品;2)材料感知风格自适应投影模块根据目标图像中相似物体的风格有针对性地学习两个调制参数,控制违禁物品的外观;3)基于TIP原理设计了一种新的对数形式损失,以无监督的方式优化合成结果。我们在SIXray、OPIXray、PIXray和PIDray四个公共数据集上全面验证了合成x射线图像的真实性和训练效果,结果证实我们的框架可以灵活地生成非常逼真的合成图像,没有任何限制。
{"title":"Meta-TIP: An Unsupervised End-to-End Fusion Network for Multi-Dataset Style-Adaptive Threat Image Projection","authors":"Bowen Ma;Tong Jia;Hao Wang;Dongyue Chen","doi":"10.1109/TIP.2025.3609135","DOIUrl":"https://doi.org/10.1109/TIP.2025.3609135","url":null,"abstract":"Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8317-8331"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Face Forgery Detection With CLIP-Enhanced Multi-Encoder Distillation 人脸伪造检测与剪辑增强多编码器蒸馏
IF 13.7 Pub Date : 2025-12-19 DOI: 10.1109/TIP.2025.3644125
Chunlei Peng;Tianzhe Yan;Decheng Liu;Nannan Wang;Ruimin Hu;Xinbo Gao
With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.
随着人脸伪造技术的发展,假人脸猖獗,威胁着许多领域的安全性和真实性。因此,研究人脸伪造检测具有十分重要的意义。目前,现有的检测方法在特征提取的全面性和模型适应性方面存在不足,难以准确处理复杂多变的伪造场景。然而,多模态模型的兴起为当前的伪造检测方法提供了新的见解。目前,大多数方法使用相对简单的文本提示来描述真脸和假脸之间的差异。然而,这些研究人员忽略了CLIP模型本身并不具备伪造检测的相关知识。为此,本文提出了一种基于多编码器融合和跨模态知识蒸馏的人脸伪造检测方法。一方面,融合了CLIP模型和伪造模型的先验知识;另一方面,通过对齐蒸馏,学生模型可以学习到教师模型捕获的伪造样本的视觉异常模式和语义特征。具体而言,本文通过融合CLIP文本编码器和CLIP图像编码器提取人脸照片的特征,并利用伪造检测领域的数据集对Deepfake-V2-Model进行预训练和微调,增强检测能力,将其作为教师模型。同时,将教师模型的视觉和语言模式与预训练学生模型的视觉模式对齐,并将对齐后的表征细化到学生模型。这不仅结合了CLIP图像编码器的丰富表示和文本嵌入的优秀泛化能力,而且使原始模型能够有效地获取用于伪造检测的相关知识。实验表明,该方法有效地提高了人脸伪造检测的性能。
{"title":"Face Forgery Detection With CLIP-Enhanced Multi-Encoder Distillation","authors":"Chunlei Peng;Tianzhe Yan;Decheng Liu;Nannan Wang;Ruimin Hu;Xinbo Gao","doi":"10.1109/TIP.2025.3644125","DOIUrl":"10.1109/TIP.2025.3644125","url":null,"abstract":"With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8474-8484"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TG-TSGNet: A Text-Guided Arbitrary-Resolution Terrain Scene Generation Network 文本引导的任意分辨率地形场景生成网络
IF 13.7 Pub Date : 2025-12-19 DOI: 10.1109/TIP.2025.3644231
Yifan Zhu;Yan Wang;Xinghui Dong
With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at https://github.com/INDTLab/TG-TSGNet). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.
随着增强现实、虚拟现实、地理制图等领域对地形可视化需求的不断增加,传统的地形场景建模方法在处理效率、内容真实感、语义一致性等方面面临巨大挑战。为了解决这些挑战,我们提出了一个文本引导的任意分辨率地形场景生成网络(TG-TSGNet),它包含一个ConvMamba-VQGAN、一个文本引导子网络和一个任意分辨率图像超分辨率模块(ARSRM)。ConvMamba-VQGAN建立在基于convbased Local Representation Block (CLRB)和基于mamba Global Representation Block (MGRB)的基础上,利用局部和全局特征。此外,文本引导子网络包括一个文本编码器和一个文本-图像对齐模块(TIAM),以便将文本语义整合到图像表示中。此外,ARSRM可以与ConvMamba-VQGAN一起训练,以完成图像超分辨率的任务。为了完成文本引导的地形场景生成任务,我们为自然地形场景数据集(NTSD)的38个类别的36,672幅图像导出了一组文本描述。这些描述可用于训练和测试TG-TSGNet(数据集、模型和源代码可在https://github.com/INDTLab/TG-TSGNet上获得)。实验结果表明,TG-TSGNet在图像真实感和语义一致性方面优于或至少与基线方法相当,并且具有适当的效率。我们认为,由于TG-TSGNet不仅能够捕获地形场景的局部和全局特征以及语义,而且还可以降低图像生成的计算成本,因此具有良好的性能。
{"title":"TG-TSGNet: A Text-Guided Arbitrary-Resolution Terrain Scene Generation Network","authors":"Yifan Zhu;Yan Wang;Xinghui Dong","doi":"10.1109/TIP.2025.3644231","DOIUrl":"10.1109/TIP.2025.3644231","url":null,"abstract":"With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at <uri>https://github.com/INDTLab/TG-TSGNet</uri>). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8614-8626"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution Hi-Mamba:高效图像超分辨率的分层曼巴。
IF 13.7 Pub Date : 2025-12-18 DOI: 10.1109/TIP.2025.3643146
Junbo Qiao;Jincheng Liao;Wei Li;Yulun Zhang;Yong Guo;Jiao Xie;Jie Hu;Shaohui Lin
Despite Transformers have achieved significant success in low-level vision tasks, they are constrained by computing self-attention with a quadratic complexity and limited-size windows. This limitation results in a lack of global receptive field across the entire image. Recently, State Space Models (SSMs) have gained widespread attention due to their global receptive field and linear complexity with respect to input length. However, integrating SSMs into low-level vision tasks presents two major challenges: 1) Relationship degradation of long-range tokens with a long-range forgetting problem by encoding pixel-by-pixel high-resolution images. 2) Significant redundancy in the existing multi-direction scanning strategy. To this end, we propose Hi-Mamba for image super-resolution (SR) to address these challenges, which unfolds the image with only a single scan. Specifically, the Global Hierarchical Mamba Block (GHMB) enables token interactions across the entire image, providing a global receptive field while leveraging a multi-scale structure to facilitate long-range dependency learning. Additionally, the Direction Alternation Module (DAM) adjusts the scanning patterns of GHMB across different layers to enhance spatial relationship modeling. Extensive experiments demonstrate that our Hi-Mamba achieves 0.2–0.27dB PSNR gains on the Urban100 dataset across different scaling factors compared to the state-of-the-art MambaIRv2 for SR. Moreover, our lightweight Hi-Mamba also outperforms lightweight SRFormer by 0.39dB PSNR for $times 2$ SR.
尽管变形金刚在低级视觉任务上取得了巨大的成功,但它们受到计算自我注意力的限制,其复杂度为二次元,窗口大小有限。这种限制导致整个图像缺乏全局接受野。近年来,状态空间模型(ssm)由于其全局接受域和相对于输入长度的线性复杂性而得到了广泛的关注。然而,将ssm集成到低级视觉任务中存在两个主要挑战:(1)通过逐像素编码高分辨率图像来降低远程标记与远程遗忘问题的关系。(2)现有多方向扫描策略存在冗余。为此,我们提出了Hi-Mamba图像超分辨率(SR)来解决这些挑战,它只需要一次扫描就可以展开图像。具体来说,全局分层曼巴块(GHMB)支持跨整个图像的令牌交互,在利用多尺度结构促进远程依赖学习的同时提供全局接受场。此外,方向转换模块(DAM)调整了GHMB在不同层间的扫描模式,以增强空间关系建模。广泛的实验表明,与最先进的MambaIRv2相比,我们的Hi-Mamba在不同比例因子的Urban100数据集上实现了0.2-0.27dB的PSNR增益。此外,我们的轻量级Hi-Mamba在×2 SR上的PSNR也优于轻量级SRFormer 0.39dB。
{"title":"Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution","authors":"Junbo Qiao;Jincheng Liao;Wei Li;Yulun Zhang;Yong Guo;Jiao Xie;Jie Hu;Shaohui Lin","doi":"10.1109/TIP.2025.3643146","DOIUrl":"10.1109/TIP.2025.3643146","url":null,"abstract":"Despite Transformers have achieved significant success in low-level vision tasks, they are constrained by computing self-attention with a quadratic complexity and limited-size windows. This limitation results in a lack of global receptive field across the entire image. Recently, State Space Models (SSMs) have gained widespread attention due to their global receptive field and linear complexity with respect to input length. However, integrating SSMs into low-level vision tasks presents two major challenges: 1) Relationship degradation of long-range tokens with a long-range forgetting problem by encoding pixel-by-pixel high-resolution images. 2) Significant redundancy in the existing multi-direction scanning strategy. To this end, we propose Hi-Mamba for image super-resolution (SR) to address these challenges, which unfolds the image with only a single scan. Specifically, the Global Hierarchical Mamba Block (GHMB) enables token interactions across the entire image, providing a global receptive field while leveraging a multi-scale structure to facilitate long-range dependency learning. Additionally, the Direction Alternation Module (DAM) adjusts the scanning patterns of GHMB across different layers to enhance spatial relationship modeling. Extensive experiments demonstrate that our Hi-Mamba achieves 0.2–0.27dB PSNR gains on the Urban100 dataset across different scaling factors compared to the state-of-the-art MambaIRv2 for SR. Moreover, our lightweight Hi-Mamba also outperforms lightweight SRFormer by 0.39dB PSNR for <inline-formula> <tex-math>$times 2$ </tex-math></inline-formula> SR.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8461-8473"},"PeriodicalIF":13.7,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145777441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ViV-ReID: Bidirectional Structural-Aware Spatial–Temporal Graph Networks on Large-Scale Video-Based Vessel Re-Identification Dataset 基于大规模视频的船舶再识别数据集的双向结构感知时空图网络。
IF 13.7 Pub Date : 2025-12-18 DOI: 10.1109/TIP.2025.3643156
Mingxin Zhang;Fuxiang Feng;Xing Fang;Lin Zhang;Youmei Zhang;Xiaolei Li;Wei Zhang
Vessel re-identification (ReID) serves as a foundational task for intelligent maritime transportation systems. To enhance maritime surveillance capabilities, this study investigates video-based vessel ReID, a critical yet underexplored task in intelligent transportation systems. The lack of relevant datasets has limited the progress of Video-based vessel ReID research work. We established ViV-ReID, the first publicly available large-scale video-based vessel ReID dataset, comprising 480 vessel identities captured from 20 cross-port camera views (7,165 tracklets and 1.14 million frames), establishing a benchmark for advancing vessel ReID from image to video processing. Videos offer significantly richer information than single-frame images. The dynamic nature of video often leads to fragmented spatio-temporal features causing disrupted contextual understanding, and to address this problem, we further propose a Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) that explicitly aligns spatio-temporal features using vessel structural priors. Extensive experiments on the ViV-ReID dataset demonstrate that image-based ReID methods often show suboptimal performance when applied to video data. Meanwhile, it is crucial to validate the effectiveness of spatio-temporal information and establish performance benchmarks for different methods. The Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) significantly outperforms state-of-the-art methods on ViV-ReID, confirming its efficacy in modeling vessel-specific spatio-temporal patterns. Project web page: https://vsislab.github.io/ViV_ReID/
船舶再识别(ReID)是智能海上运输系统的基础任务。为了增强海上监视能力,本研究调查了基于视频的船舶ReID,这是智能交通系统中一个关键但尚未充分探索的任务。相关数据集的缺乏限制了基于视频的舰船ReID研究工作的进展。我们建立了ViV-ReID,这是第一个公开可用的大规模基于视频的船舶ReID数据集,包括从20个跨端口摄像机视图(7165个轨迹和114万帧)捕获的480个船舶身份,为将船舶ReID从图像推进到视频处理建立了基准。视频比单帧图像提供更丰富的信息。视频的动态性通常会导致时空特征的碎片化,从而导致上下文理解的中断,为了解决这个问题,我们进一步提出了一个双向结构感知时空图网络(Bi-SSTN),该网络使用血管结构先验明确地对齐时空特征。在ViV-ReID数据集上进行的大量实验表明,基于图像的ReID方法在应用于视频数据时往往表现出次优性能。同时,验证时空信息的有效性和建立不同方法的性能基准是至关重要的。双向结构感知时空图网络(Bi-SSTN)在ViV-ReID上的表现明显优于最先进的方法,证实了其在血管特异性时空模式建模方面的有效性。项目网页:https://vsislab.github.io/ViV_ReID/。
{"title":"ViV-ReID: Bidirectional Structural-Aware Spatial–Temporal Graph Networks on Large-Scale Video-Based Vessel Re-Identification Dataset","authors":"Mingxin Zhang;Fuxiang Feng;Xing Fang;Lin Zhang;Youmei Zhang;Xiaolei Li;Wei Zhang","doi":"10.1109/TIP.2025.3643156","DOIUrl":"10.1109/TIP.2025.3643156","url":null,"abstract":"Vessel re-identification (ReID) serves as a foundational task for intelligent maritime transportation systems. To enhance maritime surveillance capabilities, this study investigates video-based vessel ReID, a critical yet underexplored task in intelligent transportation systems. The lack of relevant datasets has limited the progress of Video-based vessel ReID research work. We established ViV-ReID, the first publicly available large-scale video-based vessel ReID dataset, comprising 480 vessel identities captured from 20 cross-port camera views (7,165 tracklets and 1.14 million frames), establishing a benchmark for advancing vessel ReID from image to video processing. Videos offer significantly richer information than single-frame images. The dynamic nature of video often leads to fragmented spatio-temporal features causing disrupted contextual understanding, and to address this problem, we further propose a Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) that explicitly aligns spatio-temporal features using vessel structural priors. Extensive experiments on the ViV-ReID dataset demonstrate that image-based ReID methods often show suboptimal performance when applied to video data. Meanwhile, it is crucial to validate the effectiveness of spatio-temporal information and establish performance benchmarks for different methods. The Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) significantly outperforms state-of-the-art methods on ViV-ReID, confirming its efficacy in modeling vessel-specific spatio-temporal patterns. Project web page: <uri>https://vsislab.github.io/ViV_ReID/</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8485-8499"},"PeriodicalIF":13.7,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145777297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking Artifact Mitigation in HDR Reconstruction: From Detection to Optimization 重新思考HDR重建中的伪影缓解:从检测到优化
IF 13.7 Pub Date : 2025-12-17 DOI: 10.1109/TIP.2025.3642557
Xinyue Li;Zhangkai Ni;Hang Wu;Wenhan Yang;Hanli Wang;Lianghua He;Sam Kwong
Artifact remains a long-standing challenge in High Dynamic Range (HDR) reconstruction. Existing methods focus on model designs for artifact mitigation but ignore explicit detection and suppression strategies. Because artifact lacks clear boundaries, distinct shapes, and semantic consistency, and there is no existing dedicated dataset for HDR artifact, progress in direct artifact detection and recovery is impeded. To bridge the gap, we propose a unified HDR reconstruction framework that integrates artifact detection and model optimization. Firstly, we build the first HDR artifact dataset (HADataset), comprising 1,213 diverse multi-exposure Low Dynamic Range (LDR) image sets and 1,765 HDR image pairs with per-pixel artifact annotations. Secondly, we develop an effective HDR artifact detector (HADetector), a robust artifact detection model capable of accurately localizing HDR reconstruction artifact. HADetector plays two pivotal roles: (1) enhancing existing HDR reconstruction models through fine-tuning, and (2) serving as a non-reference image quality assessment (NR-IQA) metric, the Artifact Score (AS), which aligns closely with human visual perception for reliable quality evaluation. Extensive experiments validate the effectiveness and generalizability of our framework, including the HADataset, HADetector, fine-tuning paradigm, and AS metric. The code and datasets are available at: https://github.com/xinyueliii/hdr-artifact-detect-optimize
在高动态范围(HDR)重建中,伪影是一个长期存在的挑战。现有的方法侧重于人为干扰缓解的模型设计,而忽略了明确的检测和抑制策略。由于工件缺乏明确的边界、不同的形状和语义一致性,并且没有现有的HDR工件专用数据集,阻碍了直接工件检测和恢复的进展。为了弥补这一差距,我们提出了一个统一的HDR重建框架,该框架集成了伪影检测和模型优化。首先,我们构建了第一个HDR伪数据集(HADataset),该数据集包括1213个不同的多曝光低动态范围(LDR)图像集和1765个带有像素伪注释的HDR图像对。其次,我们开发了一种有效的HDR伪像检测器(HADetector),这是一种能够准确定位HDR重建伪像的鲁棒伪像检测模型。HADetector有两个关键作用:(1)通过微调增强现有的HDR重建模型;(2)作为一种非参考图像质量评估(NR-IQA)指标,即与人类视觉感知密切相关的伪像评分(as),以进行可靠的质量评估。大量的实验验证了我们的框架的有效性和可泛化性,包括HADataset、HADetector、微调范例和AS度量。代码和数据集可在:https://github.com/xinyueliii/hdr-artifact-detect-optimize
{"title":"Rethinking Artifact Mitigation in HDR Reconstruction: From Detection to Optimization","authors":"Xinyue Li;Zhangkai Ni;Hang Wu;Wenhan Yang;Hanli Wang;Lianghua He;Sam Kwong","doi":"10.1109/TIP.2025.3642557","DOIUrl":"10.1109/TIP.2025.3642557","url":null,"abstract":"Artifact remains a long-standing challenge in High Dynamic Range (HDR) reconstruction. Existing methods focus on model designs for artifact mitigation but ignore explicit detection and suppression strategies. Because artifact lacks clear boundaries, distinct shapes, and semantic consistency, and there is no existing dedicated dataset for HDR artifact, progress in direct artifact detection and recovery is impeded. To bridge the gap, we propose a unified HDR reconstruction framework that integrates artifact detection and model optimization. Firstly, we build the first HDR artifact dataset (HADataset), comprising 1,213 diverse multi-exposure Low Dynamic Range (LDR) image sets and 1,765 HDR image pairs with per-pixel artifact annotations. Secondly, we develop an effective HDR artifact detector (HADetector), a robust artifact detection model capable of accurately localizing HDR reconstruction artifact. HADetector plays two pivotal roles: (1) enhancing existing HDR reconstruction models through fine-tuning, and (2) serving as a non-reference image quality assessment (NR-IQA) metric, the Artifact Score (AS), which aligns closely with human visual perception for reliable quality evaluation. Extensive experiments validate the effectiveness and generalizability of our framework, including the HADataset, HADetector, fine-tuning paradigm, and AS metric. The code and datasets are available at: <uri>https://github.com/xinyueliii/hdr-artifact-detect-optimize</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8435-8446"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incomplete Modalities Restoration via Hierarchical Adaptation for Robust Multimodal Segmentation 基于层次自适应的不完全模态恢复鲁棒多模态分割
IF 13.7 Pub Date : 2025-12-17 DOI: 10.1109/TIP.2025.3642612
Yujia Sun;Weisheng Dong;Peng Wu;Mingtao Feng;Tao Huang;Xin Li;Guangming Shi
Multimodal semantic segmentation has significantly advanced the field of semantic segmentation by integrating data from multiple sources. However, this task often encounters missing modality scenarios due to challenges such as sensor failures or data transmission errors, which can result in substantial performance degradation. Existing approaches to addressing missing modalities predominantly involve training separate models tailored to specific missing scenarios, typically requiring considerable computational resources. In this paper, we propose a Hierarchical Adaptation framework to Restore Missing Modalities for Multimodal segmentation (HARM3), which enables frozen pretrained multimodal models to be directly applied to missing-modality semantic segmentation tasks with minimal parameter updates. Central to HARM3 is a text-instructed missing modality prompt module, which learns multimodal semantic knowledge by utilizing available modalities and textual instructions to generate prompts for the missing modalities. By incorporating a small set of trainable parameters, this module effectively facilitates knowledge transfer between high-resource domains and low-resource domains where missing modalities are more prevalent. Besides, to further enhance the model’s robustness and adaptability, we introduce adaptive perturbation training and an affine modality adapter. Extensive experimental results demonstrate the effectiveness and robustness of HARM3 across a variety of missing modality scenarios.
多模态语义分割通过对多源数据的整合,极大地推动了语义分割领域的发展。然而,由于传感器故障或数据传输错误等挑战,该任务经常遇到缺少模态的情况,这可能导致性能大幅下降。解决缺失模式的现有方法主要涉及针对特定缺失场景训练单独的模型,通常需要大量的计算资源。在本文中,我们提出了一个分层自适应框架来恢复缺失模态的多模态分割(HARM3),它使冻结的预训练的多模态模型能够以最小的参数更新直接应用于缺失模态的语义分割任务。HARM3的核心是一个文本指示的缺失情态提示模块,它通过利用可用的模态和文本指令来生成缺失模态的提示,从而学习多模态语义知识。通过结合一组小的可训练参数,该模块有效地促进了高资源领域和低资源领域之间的知识转移,其中缺失模式更为普遍。此外,为了进一步提高模型的鲁棒性和适应性,我们引入了自适应扰动训练和仿射模态适配器。大量的实验结果证明了HARM3在各种缺失模态场景下的有效性和鲁棒性。
{"title":"Incomplete Modalities Restoration via Hierarchical Adaptation for Robust Multimodal Segmentation","authors":"Yujia Sun;Weisheng Dong;Peng Wu;Mingtao Feng;Tao Huang;Xin Li;Guangming Shi","doi":"10.1109/TIP.2025.3642612","DOIUrl":"10.1109/TIP.2025.3642612","url":null,"abstract":"Multimodal semantic segmentation has significantly advanced the field of semantic segmentation by integrating data from multiple sources. However, this task often encounters missing modality scenarios due to challenges such as sensor failures or data transmission errors, which can result in substantial performance degradation. Existing approaches to addressing missing modalities predominantly involve training separate models tailored to specific missing scenarios, typically requiring considerable computational resources. In this paper, we propose a Hierarchical Adaptation framework to Restore Missing Modalities for Multimodal segmentation (HARM3), which enables frozen pretrained multimodal models to be directly applied to missing-modality semantic segmentation tasks with minimal parameter updates. Central to HARM3 is a text-instructed missing modality prompt module, which learns multimodal semantic knowledge by utilizing available modalities and textual instructions to generate prompts for the missing modalities. By incorporating a small set of trainable parameters, this module effectively facilitates knowledge transfer between high-resource domains and low-resource domains where missing modalities are more prevalent. Besides, to further enhance the model’s robustness and adaptability, we introduce adaptive perturbation training and an affine modality adapter. Extensive experimental results demonstrate the effectiveness and robustness of HARM3 across a variety of missing modality scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8672-8683"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Micro-Expression Analysis Based on Self-Adaptive Pseudo-Labeling and Residual Connected Channel Attention Mechanisms 基于自适应伪标记和残差连通通道注意机制的微表情分析
IF 13.7 Pub Date : 2025-12-17 DOI: 10.1109/TIP.2025.3642527
Jinxiu Zhang;Weidong Min;Jiahao Li;Qing Han
Micro-expressions can reveal genuine emotions that are not easily concealed, making them invaluable in fields such as psychotherapy and criminal interrogation. However, existing pseudo-labeling-based methods for micro-expression analysis have two major limitations. First, pseudo-labels generated by the sliding window do not account for the actual proportion of micro-expressions in the video, which leads to inaccurate labeling. Second, they predominantly focus on overall features, thereby neglecting subtle features. In this paper, we propose a micro-expression analysis method called Spot-Then-Recognize Method (STRM), which integrates spotting and recognition tasks. To address the first limitation, we propose a Self-Adaptive Pseudo-labeling Method (SAPM) that dynamically assigns pseudo-labels to micro-expression frames according to their actual proportion in the video sequence, thereby improving labeling accuracy. To address second limitation, we design a Multi-Scale Residual Channel Attention Network (MSRCAN) to effectively extract subtle micro-expression features. The MSRCAN comprises three modules: Multi-Scale Shared Network (MSSN), Spotting Network, and Recognition Network. The MSSN initially extracts micro-expression features by performing multi-scale feature extraction with Residual Connected Channel Attention Modules (RCCAM), which are then refined in the spotting and recognition networks. We conducted comprehensive experiments on three short video datasets (CASME II, SMIC-E-HS, SMIC-E-NIR) and two long video datasets (CAS(ME)2, SAMMLV). Experimental results show that our proposed method significantly outperforms existing methods, achieving an overall performance of 58.24%, a 19.62% improvement, and a $1.51times $ gain over the baseline in terms of micro-expression analysis.
微表情可以揭示不易隐藏的真实情绪,在心理治疗和刑事审讯等领域具有不可估量的价值。然而,现有的基于伪标注的微表情分析方法存在两大局限性。首先,滑动窗口生成的伪标签没有考虑到微表情在视频中的实际占比,导致标注不准确。其次,他们主要关注整体特征,从而忽略了细微特征。在本文中,我们提出了一种微表情分析方法,称为spot - then - recognition method (STRM),它集成了发现和识别任务。为了解决第一个限制,我们提出了一种自适应伪标记方法(SAPM),该方法根据微表情帧在视频序列中的实际比例动态分配伪标记,从而提高标记精度。为了解决第二个限制,我们设计了一个多尺度剩余通道注意网络(MSRCAN)来有效地提取细微的微表情特征。MSRCAN包括三个模块:多尺度共享网络(MSSN)、发现网络(Spotting Network)和识别网络(Recognition Network)。该方法首先利用残差连通通道注意模块(RCCAM)进行多尺度特征提取,提取微表情特征,然后在发现和识别网络中进行细化。在3个短视频数据集(CASME II、SMIC-E-HS、SMIC-E-NIR)和2个长视频数据集(CAS(ME)2、SAMMLV)上进行了综合实验。实验结果表明,我们提出的方法显著优于现有方法,在微表情分析方面,总体性能提高了58.24%,提高了19.62%,比基线提高了1.51倍。
{"title":"Micro-Expression Analysis Based on Self-Adaptive Pseudo-Labeling and Residual Connected Channel Attention Mechanisms","authors":"Jinxiu Zhang;Weidong Min;Jiahao Li;Qing Han","doi":"10.1109/TIP.2025.3642527","DOIUrl":"10.1109/TIP.2025.3642527","url":null,"abstract":"Micro-expressions can reveal genuine emotions that are not easily concealed, making them invaluable in fields such as psychotherapy and criminal interrogation. However, existing pseudo-labeling-based methods for micro-expression analysis have two major limitations. First, pseudo-labels generated by the sliding window do not account for the actual proportion of micro-expressions in the video, which leads to inaccurate labeling. Second, they predominantly focus on overall features, thereby neglecting subtle features. In this paper, we propose a micro-expression analysis method called Spot-Then-Recognize Method (STRM), which integrates spotting and recognition tasks. To address the first limitation, we propose a Self-Adaptive Pseudo-labeling Method (SAPM) that dynamically assigns pseudo-labels to micro-expression frames according to their actual proportion in the video sequence, thereby improving labeling accuracy. To address second limitation, we design a Multi-Scale Residual Channel Attention Network (MSRCAN) to effectively extract subtle micro-expression features. The MSRCAN comprises three modules: Multi-Scale Shared Network (MSSN), Spotting Network, and Recognition Network. The MSSN initially extracts micro-expression features by performing multi-scale feature extraction with Residual Connected Channel Attention Modules (RCCAM), which are then refined in the spotting and recognition networks. We conducted comprehensive experiments on three short video datasets (CASME II, SMIC-E-HS, SMIC-E-NIR) and two long video datasets (CAS(ME)2, SAMMLV). Experimental results show that our proposed method significantly outperforms existing methods, achieving an overall performance of 58.24%, a 19.62% improvement, and a <inline-formula> <tex-math>$1.51times $ </tex-math></inline-formula> gain over the baseline in terms of micro-expression analysis.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"221-233"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145771081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FA-Net: A Feature Alignment Network for Video-Based Visible-Infrared Person Re-Identification FA-Net:一种基于视频的可见-红外人物再识别特征对齐网络
IF 13.7 Pub Date : 2025-12-17 DOI: 10.1109/TIP.2025.3642633
Xi Yang;Wenjiao Dong;Xian Wang;De Cheng;Nannan Wang
Video-based visible-infrared person re-identification (VVI-ReID) aims to match target pedestrians between visible and infrared videos, which is significantly applied in 24-hour surveillance systems. The key of VVI-ReID is to learn modality invariant and spatio-temporal invariant sequence-level representation to solve the challenges such as modality differences, spatio-temporal misalignment, and domain shift noise. However, existing methods predominantly emphasize on reducing modality discrepancy while relatively neglect temporal misalignment and domain shift noise reduction. To this end, this paper proposes a VVI-ReID framework called Feature Alignment Network (FA-Net) from the perspective of feature alignment, aiming to mitigate temporal misalignment. FA-Net comprises two main alignment modules: Spatial-Temporal Alignment Module (STAM) and Modality Distribution Constraint (MDC). STAM integrates global and local features to ensure individuals’ spatial representation alignment. Additionally, STAM also establishes temporal relationships by exploring inter-frame features to address cross-frame person feature matching. Furthermore, we introduce the Modality Distribution Constraint (MDC), which utilizes a symmetric distribution loss to align the distributions of features from different modalities. Besides, the SAM Guidance Augmentation (SAM-GA) strategy is designed to transform the image space of RGB and IR frames to provide more informative and less noisy frame information. Extensive experimental results demonstrate the effectiveness of the proposed method, surpassing existing state-of-the-art methods. Our code will be available at: https://github.com/code/FANet
基于视频的可见-红外人再识别(VVI-ReID)技术的目的是在可见和红外视频之间匹配目标行人,在24小时监控系统中有重要应用。VVI-ReID的关键是学习模态不变和时空不变的序列级表示,以解决模态差异、时空失调和域漂移噪声等问题。然而,现有的方法主要侧重于减少模态差异,而相对忽视了时间失调和域移噪声的抑制。为此,本文从特征对齐的角度提出了一种VVI-ReID框架,称为Feature Alignment Network (FA-Net),旨在缓解时序失调。FA-Net包括两个主要的对齐模块:时空对齐模块(STAM)和模态分布约束(MDC)。STAM整合了全局和局部特征,以确保个体的空间表征一致。此外,STAM还通过探索帧间特征来建立时间关系,以解决跨帧人特征匹配问题。此外,我们引入了模态分布约束(MDC),它利用对称分布损失来对齐来自不同模态的特征分布。此外,设计了地对空制导增强(SAM- ga)策略,对RGB和IR帧的图像空间进行变换,提供信息量更大、噪声更小的帧信息。大量的实验结果证明了该方法的有效性,超越了现有的最先进的方法。我们的代码将在https://github.com/code/FANet上提供
{"title":"FA-Net: A Feature Alignment Network for Video-Based Visible-Infrared Person Re-Identification","authors":"Xi Yang;Wenjiao Dong;Xian Wang;De Cheng;Nannan Wang","doi":"10.1109/TIP.2025.3642633","DOIUrl":"10.1109/TIP.2025.3642633","url":null,"abstract":"Video-based visible-infrared person re-identification (VVI-ReID) aims to match target pedestrians between visible and infrared videos, which is significantly applied in 24-hour surveillance systems. The key of VVI-ReID is to learn modality invariant and spatio-temporal invariant sequence-level representation to solve the challenges such as modality differences, spatio-temporal misalignment, and domain shift noise. However, existing methods predominantly emphasize on reducing modality discrepancy while relatively neglect temporal misalignment and domain shift noise reduction. To this end, this paper proposes a VVI-ReID framework called Feature Alignment Network (FA-Net) from the perspective of feature alignment, aiming to mitigate temporal misalignment. FA-Net comprises two main alignment modules: Spatial-Temporal Alignment Module (STAM) and Modality Distribution Constraint (MDC). STAM integrates global and local features to ensure individuals’ spatial representation alignment. Additionally, STAM also establishes temporal relationships by exploring inter-frame features to address cross-frame person feature matching. Furthermore, we introduce the Modality Distribution Constraint (MDC), which utilizes a symmetric distribution loss to align the distributions of features from different modalities. Besides, the SAM Guidance Augmentation (SAM-GA) strategy is designed to transform the image space of RGB and IR frames to provide more informative and less noisy frame information. Extensive experimental results demonstrate the effectiveness of the proposed method, surpassing existing state-of-the-art methods. Our code will be available at: <uri>https://github.com/code/FANet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8406-8420"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NiCI-Pruning: Enhancing Diffusion Model Pruning via Noise in Clean Image Guidance nici -剪枝:在干净图像引导中通过噪声增强扩散模型剪枝
IF 13.7 Pub Date : 2025-12-17 DOI: 10.1109/TIP.2025.3643138
Junzhu Mao;Zeren Sun;Yazhou Yao;Tianfei Zhou;Liqiang Nie;Xiansheng Hua
The substantial successes achieved by diffusion probabilistic models have prompted the study of their employment in resource-limited scenarios. Pruning methods have been proven effective in compressing discriminative models relying on the correlation between training losses and model performances. However, diffusion models employ an iterative process for generating high-quality images, leading to a breakdown of such connections. To address this challenge, we propose a simple yet effective method, named NiCI-Pruning (Noise in Clean Image Pruning), for the compression of diffusion models. NiCI-Pruning capitalizes the noise predicted by the model based on clean image inputs, favoring it as a feature for establishing reconstruction losses. Accordingly, Taylor expansion is employed for the proposed reconstruction loss to evaluate the parameter importance effectively. Moreover, we propose an interval sampling strategy that incorporates a timestep-weighted schema, alleviating the risk of misleading information obtained at later timesteps. We provide comprehensive experimental results to affirm the superiority of our proposed approach. Notably, our method achieves a remarkable average reduction of 30.4% in FID score increase across five different datasets compared to the state-of-the-art diffusion pruning method at equivalent pruning rates. Our code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning
扩散概率模型取得的巨大成功促使人们对其在资源有限情况下的应用进行研究。基于训练损失和模型性能之间的相关性,剪枝方法在压缩判别模型方面被证明是有效的。然而,扩散模型采用迭代过程来生成高质量的图像,导致这种连接的崩溃。为了解决这一挑战,我们提出了一种简单而有效的方法,称为NiCI-Pruning (Noise in Clean Image Pruning),用于压缩扩散模型。NiCI-Pruning利用基于干净图像输入的模型预测的噪声,有利于将其作为建立重建损失的特征。因此,对所提出的重构损失采用泰勒展开,有效地评价了参数的重要性。此外,我们提出了一种包含时间步长加权模式的间隔采样策略,以减轻在后期时间步长获得误导性信息的风险。我们提供了全面的实验结果来证实我们所提出的方法的优越性。值得注意的是,与最先进的扩散修剪方法相比,我们的方法在相同修剪速率下,在五个不同的数据集上,FID评分平均降低了30.4%。我们的代码和模型已在https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning上提供
{"title":"NiCI-Pruning: Enhancing Diffusion Model Pruning via Noise in Clean Image Guidance","authors":"Junzhu Mao;Zeren Sun;Yazhou Yao;Tianfei Zhou;Liqiang Nie;Xiansheng Hua","doi":"10.1109/TIP.2025.3643138","DOIUrl":"10.1109/TIP.2025.3643138","url":null,"abstract":"The substantial successes achieved by diffusion probabilistic models have prompted the study of their employment in resource-limited scenarios. Pruning methods have been proven effective in compressing discriminative models relying on the correlation between training losses and model performances. However, diffusion models employ an iterative process for generating high-quality images, leading to a breakdown of such connections. To address this challenge, we propose a simple yet effective method, named NiCI-Pruning (Noise in Clean Image Pruning), for the compression of diffusion models. NiCI-Pruning capitalizes the noise predicted by the model based on clean image inputs, favoring it as a feature for establishing reconstruction losses. Accordingly, Taylor expansion is employed for the proposed reconstruction loss to evaluate the parameter importance effectively. Moreover, we propose an interval sampling strategy that incorporates a timestep-weighted schema, alleviating the risk of misleading information obtained at later timesteps. We provide comprehensive experimental results to affirm the superiority of our proposed approach. Notably, our method achieves a remarkable average reduction of 30.4% in FID score increase across five different datasets compared to the state-of-the-art diffusion pruning method at equivalent pruning rates. Our code and models have been made available at <uri>https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8447-8460"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1