Pub Date : 2025-12-19DOI: 10.1109/TIP.2025.3609135
Bowen Ma;Tong Jia;Hao Wang;Dongyue Chen
Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.
{"title":"Meta-TIP: An Unsupervised End-to-End Fusion Network for Multi-Dataset Style-Adaptive Threat Image Projection","authors":"Bowen Ma;Tong Jia;Hao Wang;Dongyue Chen","doi":"10.1109/TIP.2025.3609135","DOIUrl":"https://doi.org/10.1109/TIP.2025.3609135","url":null,"abstract":"Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8317-8331"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.
{"title":"Face Forgery Detection With CLIP-Enhanced Multi-Encoder Distillation","authors":"Chunlei Peng;Tianzhe Yan;Decheng Liu;Nannan Wang;Ruimin Hu;Xinbo Gao","doi":"10.1109/TIP.2025.3644125","DOIUrl":"10.1109/TIP.2025.3644125","url":null,"abstract":"With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8474-8484"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1109/TIP.2025.3644231
Yifan Zhu;Yan Wang;Xinghui Dong
With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at https://github.com/INDTLab/TG-TSGNet). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.
随着增强现实、虚拟现实、地理制图等领域对地形可视化需求的不断增加,传统的地形场景建模方法在处理效率、内容真实感、语义一致性等方面面临巨大挑战。为了解决这些挑战,我们提出了一个文本引导的任意分辨率地形场景生成网络(TG-TSGNet),它包含一个ConvMamba-VQGAN、一个文本引导子网络和一个任意分辨率图像超分辨率模块(ARSRM)。ConvMamba-VQGAN建立在基于convbased Local Representation Block (CLRB)和基于mamba Global Representation Block (MGRB)的基础上,利用局部和全局特征。此外,文本引导子网络包括一个文本编码器和一个文本-图像对齐模块(TIAM),以便将文本语义整合到图像表示中。此外,ARSRM可以与ConvMamba-VQGAN一起训练,以完成图像超分辨率的任务。为了完成文本引导的地形场景生成任务,我们为自然地形场景数据集(NTSD)的38个类别的36,672幅图像导出了一组文本描述。这些描述可用于训练和测试TG-TSGNet(数据集、模型和源代码可在https://github.com/INDTLab/TG-TSGNet上获得)。实验结果表明,TG-TSGNet在图像真实感和语义一致性方面优于或至少与基线方法相当,并且具有适当的效率。我们认为,由于TG-TSGNet不仅能够捕获地形场景的局部和全局特征以及语义,而且还可以降低图像生成的计算成本,因此具有良好的性能。
{"title":"TG-TSGNet: A Text-Guided Arbitrary-Resolution Terrain Scene Generation Network","authors":"Yifan Zhu;Yan Wang;Xinghui Dong","doi":"10.1109/TIP.2025.3644231","DOIUrl":"10.1109/TIP.2025.3644231","url":null,"abstract":"With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at <uri>https://github.com/INDTLab/TG-TSGNet</uri>). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8614-8626"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1109/TIP.2025.3643146
Junbo Qiao;Jincheng Liao;Wei Li;Yulun Zhang;Yong Guo;Jiao Xie;Jie Hu;Shaohui Lin
Despite Transformers have achieved significant success in low-level vision tasks, they are constrained by computing self-attention with a quadratic complexity and limited-size windows. This limitation results in a lack of global receptive field across the entire image. Recently, State Space Models (SSMs) have gained widespread attention due to their global receptive field and linear complexity with respect to input length. However, integrating SSMs into low-level vision tasks presents two major challenges: 1) Relationship degradation of long-range tokens with a long-range forgetting problem by encoding pixel-by-pixel high-resolution images. 2) Significant redundancy in the existing multi-direction scanning strategy. To this end, we propose Hi-Mamba for image super-resolution (SR) to address these challenges, which unfolds the image with only a single scan. Specifically, the Global Hierarchical Mamba Block (GHMB) enables token interactions across the entire image, providing a global receptive field while leveraging a multi-scale structure to facilitate long-range dependency learning. Additionally, the Direction Alternation Module (DAM) adjusts the scanning patterns of GHMB across different layers to enhance spatial relationship modeling. Extensive experiments demonstrate that our Hi-Mamba achieves 0.2–0.27dB PSNR gains on the Urban100 dataset across different scaling factors compared to the state-of-the-art MambaIRv2 for SR. Moreover, our lightweight Hi-Mamba also outperforms lightweight SRFormer by 0.39dB PSNR for $times 2$ SR.
{"title":"Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution","authors":"Junbo Qiao;Jincheng Liao;Wei Li;Yulun Zhang;Yong Guo;Jiao Xie;Jie Hu;Shaohui Lin","doi":"10.1109/TIP.2025.3643146","DOIUrl":"10.1109/TIP.2025.3643146","url":null,"abstract":"Despite Transformers have achieved significant success in low-level vision tasks, they are constrained by computing self-attention with a quadratic complexity and limited-size windows. This limitation results in a lack of global receptive field across the entire image. Recently, State Space Models (SSMs) have gained widespread attention due to their global receptive field and linear complexity with respect to input length. However, integrating SSMs into low-level vision tasks presents two major challenges: 1) Relationship degradation of long-range tokens with a long-range forgetting problem by encoding pixel-by-pixel high-resolution images. 2) Significant redundancy in the existing multi-direction scanning strategy. To this end, we propose Hi-Mamba for image super-resolution (SR) to address these challenges, which unfolds the image with only a single scan. Specifically, the Global Hierarchical Mamba Block (GHMB) enables token interactions across the entire image, providing a global receptive field while leveraging a multi-scale structure to facilitate long-range dependency learning. Additionally, the Direction Alternation Module (DAM) adjusts the scanning patterns of GHMB across different layers to enhance spatial relationship modeling. Extensive experiments demonstrate that our Hi-Mamba achieves 0.2–0.27dB PSNR gains on the Urban100 dataset across different scaling factors compared to the state-of-the-art MambaIRv2 for SR. Moreover, our lightweight Hi-Mamba also outperforms lightweight SRFormer by 0.39dB PSNR for <inline-formula> <tex-math>$times 2$ </tex-math></inline-formula> SR.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8461-8473"},"PeriodicalIF":13.7,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145777441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vessel re-identification (ReID) serves as a foundational task for intelligent maritime transportation systems. To enhance maritime surveillance capabilities, this study investigates video-based vessel ReID, a critical yet underexplored task in intelligent transportation systems. The lack of relevant datasets has limited the progress of Video-based vessel ReID research work. We established ViV-ReID, the first publicly available large-scale video-based vessel ReID dataset, comprising 480 vessel identities captured from 20 cross-port camera views (7,165 tracklets and 1.14 million frames), establishing a benchmark for advancing vessel ReID from image to video processing. Videos offer significantly richer information than single-frame images. The dynamic nature of video often leads to fragmented spatio-temporal features causing disrupted contextual understanding, and to address this problem, we further propose a Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) that explicitly aligns spatio-temporal features using vessel structural priors. Extensive experiments on the ViV-ReID dataset demonstrate that image-based ReID methods often show suboptimal performance when applied to video data. Meanwhile, it is crucial to validate the effectiveness of spatio-temporal information and establish performance benchmarks for different methods. The Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) significantly outperforms state-of-the-art methods on ViV-ReID, confirming its efficacy in modeling vessel-specific spatio-temporal patterns. Project web page: https://vsislab.github.io/ViV_ReID/
{"title":"ViV-ReID: Bidirectional Structural-Aware Spatial–Temporal Graph Networks on Large-Scale Video-Based Vessel Re-Identification Dataset","authors":"Mingxin Zhang;Fuxiang Feng;Xing Fang;Lin Zhang;Youmei Zhang;Xiaolei Li;Wei Zhang","doi":"10.1109/TIP.2025.3643156","DOIUrl":"10.1109/TIP.2025.3643156","url":null,"abstract":"Vessel re-identification (ReID) serves as a foundational task for intelligent maritime transportation systems. To enhance maritime surveillance capabilities, this study investigates video-based vessel ReID, a critical yet underexplored task in intelligent transportation systems. The lack of relevant datasets has limited the progress of Video-based vessel ReID research work. We established ViV-ReID, the first publicly available large-scale video-based vessel ReID dataset, comprising 480 vessel identities captured from 20 cross-port camera views (7,165 tracklets and 1.14 million frames), establishing a benchmark for advancing vessel ReID from image to video processing. Videos offer significantly richer information than single-frame images. The dynamic nature of video often leads to fragmented spatio-temporal features causing disrupted contextual understanding, and to address this problem, we further propose a Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) that explicitly aligns spatio-temporal features using vessel structural priors. Extensive experiments on the ViV-ReID dataset demonstrate that image-based ReID methods often show suboptimal performance when applied to video data. Meanwhile, it is crucial to validate the effectiveness of spatio-temporal information and establish performance benchmarks for different methods. The Bidirectional Structural-Aware Spatial-Temporal Graph Network (Bi-SSTN) significantly outperforms state-of-the-art methods on ViV-ReID, confirming its efficacy in modeling vessel-specific spatio-temporal patterns. Project web page: <uri>https://vsislab.github.io/ViV_ReID/</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8485-8499"},"PeriodicalIF":13.7,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145777297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artifact remains a long-standing challenge in High Dynamic Range (HDR) reconstruction. Existing methods focus on model designs for artifact mitigation but ignore explicit detection and suppression strategies. Because artifact lacks clear boundaries, distinct shapes, and semantic consistency, and there is no existing dedicated dataset for HDR artifact, progress in direct artifact detection and recovery is impeded. To bridge the gap, we propose a unified HDR reconstruction framework that integrates artifact detection and model optimization. Firstly, we build the first HDR artifact dataset (HADataset), comprising 1,213 diverse multi-exposure Low Dynamic Range (LDR) image sets and 1,765 HDR image pairs with per-pixel artifact annotations. Secondly, we develop an effective HDR artifact detector (HADetector), a robust artifact detection model capable of accurately localizing HDR reconstruction artifact. HADetector plays two pivotal roles: (1) enhancing existing HDR reconstruction models through fine-tuning, and (2) serving as a non-reference image quality assessment (NR-IQA) metric, the Artifact Score (AS), which aligns closely with human visual perception for reliable quality evaluation. Extensive experiments validate the effectiveness and generalizability of our framework, including the HADataset, HADetector, fine-tuning paradigm, and AS metric. The code and datasets are available at: https://github.com/xinyueliii/hdr-artifact-detect-optimize
{"title":"Rethinking Artifact Mitigation in HDR Reconstruction: From Detection to Optimization","authors":"Xinyue Li;Zhangkai Ni;Hang Wu;Wenhan Yang;Hanli Wang;Lianghua He;Sam Kwong","doi":"10.1109/TIP.2025.3642557","DOIUrl":"10.1109/TIP.2025.3642557","url":null,"abstract":"Artifact remains a long-standing challenge in High Dynamic Range (HDR) reconstruction. Existing methods focus on model designs for artifact mitigation but ignore explicit detection and suppression strategies. Because artifact lacks clear boundaries, distinct shapes, and semantic consistency, and there is no existing dedicated dataset for HDR artifact, progress in direct artifact detection and recovery is impeded. To bridge the gap, we propose a unified HDR reconstruction framework that integrates artifact detection and model optimization. Firstly, we build the first HDR artifact dataset (HADataset), comprising 1,213 diverse multi-exposure Low Dynamic Range (LDR) image sets and 1,765 HDR image pairs with per-pixel artifact annotations. Secondly, we develop an effective HDR artifact detector (HADetector), a robust artifact detection model capable of accurately localizing HDR reconstruction artifact. HADetector plays two pivotal roles: (1) enhancing existing HDR reconstruction models through fine-tuning, and (2) serving as a non-reference image quality assessment (NR-IQA) metric, the Artifact Score (AS), which aligns closely with human visual perception for reliable quality evaluation. Extensive experiments validate the effectiveness and generalizability of our framework, including the HADataset, HADetector, fine-tuning paradigm, and AS metric. The code and datasets are available at: <uri>https://github.com/xinyueliii/hdr-artifact-detect-optimize</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8435-8446"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1109/TIP.2025.3642612
Yujia Sun;Weisheng Dong;Peng Wu;Mingtao Feng;Tao Huang;Xin Li;Guangming Shi
Multimodal semantic segmentation has significantly advanced the field of semantic segmentation by integrating data from multiple sources. However, this task often encounters missing modality scenarios due to challenges such as sensor failures or data transmission errors, which can result in substantial performance degradation. Existing approaches to addressing missing modalities predominantly involve training separate models tailored to specific missing scenarios, typically requiring considerable computational resources. In this paper, we propose a Hierarchical Adaptation framework to Restore Missing Modalities for Multimodal segmentation (HARM3), which enables frozen pretrained multimodal models to be directly applied to missing-modality semantic segmentation tasks with minimal parameter updates. Central to HARM3 is a text-instructed missing modality prompt module, which learns multimodal semantic knowledge by utilizing available modalities and textual instructions to generate prompts for the missing modalities. By incorporating a small set of trainable parameters, this module effectively facilitates knowledge transfer between high-resource domains and low-resource domains where missing modalities are more prevalent. Besides, to further enhance the model’s robustness and adaptability, we introduce adaptive perturbation training and an affine modality adapter. Extensive experimental results demonstrate the effectiveness and robustness of HARM3 across a variety of missing modality scenarios.
{"title":"Incomplete Modalities Restoration via Hierarchical Adaptation for Robust Multimodal Segmentation","authors":"Yujia Sun;Weisheng Dong;Peng Wu;Mingtao Feng;Tao Huang;Xin Li;Guangming Shi","doi":"10.1109/TIP.2025.3642612","DOIUrl":"10.1109/TIP.2025.3642612","url":null,"abstract":"Multimodal semantic segmentation has significantly advanced the field of semantic segmentation by integrating data from multiple sources. However, this task often encounters missing modality scenarios due to challenges such as sensor failures or data transmission errors, which can result in substantial performance degradation. Existing approaches to addressing missing modalities predominantly involve training separate models tailored to specific missing scenarios, typically requiring considerable computational resources. In this paper, we propose a Hierarchical Adaptation framework to Restore Missing Modalities for Multimodal segmentation (HARM3), which enables frozen pretrained multimodal models to be directly applied to missing-modality semantic segmentation tasks with minimal parameter updates. Central to HARM3 is a text-instructed missing modality prompt module, which learns multimodal semantic knowledge by utilizing available modalities and textual instructions to generate prompts for the missing modalities. By incorporating a small set of trainable parameters, this module effectively facilitates knowledge transfer between high-resource domains and low-resource domains where missing modalities are more prevalent. Besides, to further enhance the model’s robustness and adaptability, we introduce adaptive perturbation training and an affine modality adapter. Extensive experimental results demonstrate the effectiveness and robustness of HARM3 across a variety of missing modality scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8672-8683"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1109/TIP.2025.3642527
Jinxiu Zhang;Weidong Min;Jiahao Li;Qing Han
Micro-expressions can reveal genuine emotions that are not easily concealed, making them invaluable in fields such as psychotherapy and criminal interrogation. However, existing pseudo-labeling-based methods for micro-expression analysis have two major limitations. First, pseudo-labels generated by the sliding window do not account for the actual proportion of micro-expressions in the video, which leads to inaccurate labeling. Second, they predominantly focus on overall features, thereby neglecting subtle features. In this paper, we propose a micro-expression analysis method called Spot-Then-Recognize Method (STRM), which integrates spotting and recognition tasks. To address the first limitation, we propose a Self-Adaptive Pseudo-labeling Method (SAPM) that dynamically assigns pseudo-labels to micro-expression frames according to their actual proportion in the video sequence, thereby improving labeling accuracy. To address second limitation, we design a Multi-Scale Residual Channel Attention Network (MSRCAN) to effectively extract subtle micro-expression features. The MSRCAN comprises three modules: Multi-Scale Shared Network (MSSN), Spotting Network, and Recognition Network. The MSSN initially extracts micro-expression features by performing multi-scale feature extraction with Residual Connected Channel Attention Modules (RCCAM), which are then refined in the spotting and recognition networks. We conducted comprehensive experiments on three short video datasets (CASME II, SMIC-E-HS, SMIC-E-NIR) and two long video datasets (CAS(ME)2, SAMMLV). Experimental results show that our proposed method significantly outperforms existing methods, achieving an overall performance of 58.24%, a 19.62% improvement, and a $1.51times $ gain over the baseline in terms of micro-expression analysis.
微表情可以揭示不易隐藏的真实情绪,在心理治疗和刑事审讯等领域具有不可估量的价值。然而,现有的基于伪标注的微表情分析方法存在两大局限性。首先,滑动窗口生成的伪标签没有考虑到微表情在视频中的实际占比,导致标注不准确。其次,他们主要关注整体特征,从而忽略了细微特征。在本文中,我们提出了一种微表情分析方法,称为spot - then - recognition method (STRM),它集成了发现和识别任务。为了解决第一个限制,我们提出了一种自适应伪标记方法(SAPM),该方法根据微表情帧在视频序列中的实际比例动态分配伪标记,从而提高标记精度。为了解决第二个限制,我们设计了一个多尺度剩余通道注意网络(MSRCAN)来有效地提取细微的微表情特征。MSRCAN包括三个模块:多尺度共享网络(MSSN)、发现网络(Spotting Network)和识别网络(Recognition Network)。该方法首先利用残差连通通道注意模块(RCCAM)进行多尺度特征提取,提取微表情特征,然后在发现和识别网络中进行细化。在3个短视频数据集(CASME II、SMIC-E-HS、SMIC-E-NIR)和2个长视频数据集(CAS(ME)2、SAMMLV)上进行了综合实验。实验结果表明,我们提出的方法显著优于现有方法,在微表情分析方面,总体性能提高了58.24%,提高了19.62%,比基线提高了1.51倍。
{"title":"Micro-Expression Analysis Based on Self-Adaptive Pseudo-Labeling and Residual Connected Channel Attention Mechanisms","authors":"Jinxiu Zhang;Weidong Min;Jiahao Li;Qing Han","doi":"10.1109/TIP.2025.3642527","DOIUrl":"10.1109/TIP.2025.3642527","url":null,"abstract":"Micro-expressions can reveal genuine emotions that are not easily concealed, making them invaluable in fields such as psychotherapy and criminal interrogation. However, existing pseudo-labeling-based methods for micro-expression analysis have two major limitations. First, pseudo-labels generated by the sliding window do not account for the actual proportion of micro-expressions in the video, which leads to inaccurate labeling. Second, they predominantly focus on overall features, thereby neglecting subtle features. In this paper, we propose a micro-expression analysis method called Spot-Then-Recognize Method (STRM), which integrates spotting and recognition tasks. To address the first limitation, we propose a Self-Adaptive Pseudo-labeling Method (SAPM) that dynamically assigns pseudo-labels to micro-expression frames according to their actual proportion in the video sequence, thereby improving labeling accuracy. To address second limitation, we design a Multi-Scale Residual Channel Attention Network (MSRCAN) to effectively extract subtle micro-expression features. The MSRCAN comprises three modules: Multi-Scale Shared Network (MSSN), Spotting Network, and Recognition Network. The MSSN initially extracts micro-expression features by performing multi-scale feature extraction with Residual Connected Channel Attention Modules (RCCAM), which are then refined in the spotting and recognition networks. We conducted comprehensive experiments on three short video datasets (CASME II, SMIC-E-HS, SMIC-E-NIR) and two long video datasets (CAS(ME)2, SAMMLV). Experimental results show that our proposed method significantly outperforms existing methods, achieving an overall performance of 58.24%, a 19.62% improvement, and a <inline-formula> <tex-math>$1.51times $ </tex-math></inline-formula> gain over the baseline in terms of micro-expression analysis.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"221-233"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145771081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1109/TIP.2025.3642633
Xi Yang;Wenjiao Dong;Xian Wang;De Cheng;Nannan Wang
Video-based visible-infrared person re-identification (VVI-ReID) aims to match target pedestrians between visible and infrared videos, which is significantly applied in 24-hour surveillance systems. The key of VVI-ReID is to learn modality invariant and spatio-temporal invariant sequence-level representation to solve the challenges such as modality differences, spatio-temporal misalignment, and domain shift noise. However, existing methods predominantly emphasize on reducing modality discrepancy while relatively neglect temporal misalignment and domain shift noise reduction. To this end, this paper proposes a VVI-ReID framework called Feature Alignment Network (FA-Net) from the perspective of feature alignment, aiming to mitigate temporal misalignment. FA-Net comprises two main alignment modules: Spatial-Temporal Alignment Module (STAM) and Modality Distribution Constraint (MDC). STAM integrates global and local features to ensure individuals’ spatial representation alignment. Additionally, STAM also establishes temporal relationships by exploring inter-frame features to address cross-frame person feature matching. Furthermore, we introduce the Modality Distribution Constraint (MDC), which utilizes a symmetric distribution loss to align the distributions of features from different modalities. Besides, the SAM Guidance Augmentation (SAM-GA) strategy is designed to transform the image space of RGB and IR frames to provide more informative and less noisy frame information. Extensive experimental results demonstrate the effectiveness of the proposed method, surpassing existing state-of-the-art methods. Our code will be available at: https://github.com/code/FANet
{"title":"FA-Net: A Feature Alignment Network for Video-Based Visible-Infrared Person Re-Identification","authors":"Xi Yang;Wenjiao Dong;Xian Wang;De Cheng;Nannan Wang","doi":"10.1109/TIP.2025.3642633","DOIUrl":"10.1109/TIP.2025.3642633","url":null,"abstract":"Video-based visible-infrared person re-identification (VVI-ReID) aims to match target pedestrians between visible and infrared videos, which is significantly applied in 24-hour surveillance systems. The key of VVI-ReID is to learn modality invariant and spatio-temporal invariant sequence-level representation to solve the challenges such as modality differences, spatio-temporal misalignment, and domain shift noise. However, existing methods predominantly emphasize on reducing modality discrepancy while relatively neglect temporal misalignment and domain shift noise reduction. To this end, this paper proposes a VVI-ReID framework called Feature Alignment Network (FA-Net) from the perspective of feature alignment, aiming to mitigate temporal misalignment. FA-Net comprises two main alignment modules: Spatial-Temporal Alignment Module (STAM) and Modality Distribution Constraint (MDC). STAM integrates global and local features to ensure individuals’ spatial representation alignment. Additionally, STAM also establishes temporal relationships by exploring inter-frame features to address cross-frame person feature matching. Furthermore, we introduce the Modality Distribution Constraint (MDC), which utilizes a symmetric distribution loss to align the distributions of features from different modalities. Besides, the SAM Guidance Augmentation (SAM-GA) strategy is designed to transform the image space of RGB and IR frames to provide more informative and less noisy frame information. Extensive experimental results demonstrate the effectiveness of the proposed method, surpassing existing state-of-the-art methods. Our code will be available at: <uri>https://github.com/code/FANet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8406-8420"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The substantial successes achieved by diffusion probabilistic models have prompted the study of their employment in resource-limited scenarios. Pruning methods have been proven effective in compressing discriminative models relying on the correlation between training losses and model performances. However, diffusion models employ an iterative process for generating high-quality images, leading to a breakdown of such connections. To address this challenge, we propose a simple yet effective method, named NiCI-Pruning (Noise in Clean Image Pruning), for the compression of diffusion models. NiCI-Pruning capitalizes the noise predicted by the model based on clean image inputs, favoring it as a feature for establishing reconstruction losses. Accordingly, Taylor expansion is employed for the proposed reconstruction loss to evaluate the parameter importance effectively. Moreover, we propose an interval sampling strategy that incorporates a timestep-weighted schema, alleviating the risk of misleading information obtained at later timesteps. We provide comprehensive experimental results to affirm the superiority of our proposed approach. Notably, our method achieves a remarkable average reduction of 30.4% in FID score increase across five different datasets compared to the state-of-the-art diffusion pruning method at equivalent pruning rates. Our code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning
扩散概率模型取得的巨大成功促使人们对其在资源有限情况下的应用进行研究。基于训练损失和模型性能之间的相关性,剪枝方法在压缩判别模型方面被证明是有效的。然而,扩散模型采用迭代过程来生成高质量的图像,导致这种连接的崩溃。为了解决这一挑战,我们提出了一种简单而有效的方法,称为NiCI-Pruning (Noise in Clean Image Pruning),用于压缩扩散模型。NiCI-Pruning利用基于干净图像输入的模型预测的噪声,有利于将其作为建立重建损失的特征。因此,对所提出的重构损失采用泰勒展开,有效地评价了参数的重要性。此外,我们提出了一种包含时间步长加权模式的间隔采样策略,以减轻在后期时间步长获得误导性信息的风险。我们提供了全面的实验结果来证实我们所提出的方法的优越性。值得注意的是,与最先进的扩散修剪方法相比,我们的方法在相同修剪速率下,在五个不同的数据集上,FID评分平均降低了30.4%。我们的代码和模型已在https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning上提供
{"title":"NiCI-Pruning: Enhancing Diffusion Model Pruning via Noise in Clean Image Guidance","authors":"Junzhu Mao;Zeren Sun;Yazhou Yao;Tianfei Zhou;Liqiang Nie;Xiansheng Hua","doi":"10.1109/TIP.2025.3643138","DOIUrl":"10.1109/TIP.2025.3643138","url":null,"abstract":"The substantial successes achieved by diffusion probabilistic models have prompted the study of their employment in resource-limited scenarios. Pruning methods have been proven effective in compressing discriminative models relying on the correlation between training losses and model performances. However, diffusion models employ an iterative process for generating high-quality images, leading to a breakdown of such connections. To address this challenge, we propose a simple yet effective method, named NiCI-Pruning (Noise in Clean Image Pruning), for the compression of diffusion models. NiCI-Pruning capitalizes the noise predicted by the model based on clean image inputs, favoring it as a feature for establishing reconstruction losses. Accordingly, Taylor expansion is employed for the proposed reconstruction loss to evaluate the parameter importance effectively. Moreover, we propose an interval sampling strategy that incorporates a timestep-weighted schema, alleviating the risk of misleading information obtained at later timesteps. We provide comprehensive experimental results to affirm the superiority of our proposed approach. Notably, our method achieves a remarkable average reduction of 30.4% in FID score increase across five different datasets compared to the state-of-the-art diffusion pruning method at equivalent pruning rates. Our code and models have been made available at <uri>https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8447-8460"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}