Artifact remains a long-standing challenge in High Dynamic Range (HDR) reconstruction. Existing methods focus on model designs for artifact mitigation but ignore explicit detection and suppression strategies. Because artifact lacks clear boundaries, distinct shapes, and semantic consistency, and there is no existing dedicated dataset for HDR artifact, progress in direct artifact detection and recovery is impeded. To bridge the gap, we propose a unified HDR reconstruction framework that integrates artifact detection and model optimization. Firstly, we build the first HDR artifact dataset (HADataset), comprising 1,213 diverse multi-exposure Low Dynamic Range (LDR) image sets and 1,765 HDR image pairs with per-pixel artifact annotations. Secondly, we develop an effective HDR artifact detector (HADetector), a robust artifact detection model capable of accurately localizing HDR reconstruction artifact. HADetector plays two pivotal roles: (1) enhancing existing HDR reconstruction models through fine-tuning, and (2) serving as a non-reference image quality assessment (NR-IQA) metric, the Artifact Score (AS), which aligns closely with human visual perception for reliable quality evaluation. Extensive experiments validate the effectiveness and generalizability of our framework, including the HADataset, HADetector, fine-tuning paradigm, and AS metric. The code and datasets are available at: https://github.com/xinyueliii/hdr-artifact-detect-optimize
{"title":"Rethinking Artifact Mitigation in HDR Reconstruction: From Detection to Optimization","authors":"Xinyue Li;Zhangkai Ni;Hang Wu;Wenhan Yang;Hanli Wang;Lianghua He;Sam Kwong","doi":"10.1109/TIP.2025.3642557","DOIUrl":"10.1109/TIP.2025.3642557","url":null,"abstract":"Artifact remains a long-standing challenge in High Dynamic Range (HDR) reconstruction. Existing methods focus on model designs for artifact mitigation but ignore explicit detection and suppression strategies. Because artifact lacks clear boundaries, distinct shapes, and semantic consistency, and there is no existing dedicated dataset for HDR artifact, progress in direct artifact detection and recovery is impeded. To bridge the gap, we propose a unified HDR reconstruction framework that integrates artifact detection and model optimization. Firstly, we build the first HDR artifact dataset (HADataset), comprising 1,213 diverse multi-exposure Low Dynamic Range (LDR) image sets and 1,765 HDR image pairs with per-pixel artifact annotations. Secondly, we develop an effective HDR artifact detector (HADetector), a robust artifact detection model capable of accurately localizing HDR reconstruction artifact. HADetector plays two pivotal roles: (1) enhancing existing HDR reconstruction models through fine-tuning, and (2) serving as a non-reference image quality assessment (NR-IQA) metric, the Artifact Score (AS), which aligns closely with human visual perception for reliable quality evaluation. Extensive experiments validate the effectiveness and generalizability of our framework, including the HADataset, HADetector, fine-tuning paradigm, and AS metric. The code and datasets are available at: <uri>https://github.com/xinyueliii/hdr-artifact-detect-optimize</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8435-8446"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1109/TIP.2025.3642612
Yujia Sun;Weisheng Dong;Peng Wu;Mingtao Feng;Tao Huang;Xin Li;Guangming Shi
Multimodal semantic segmentation has significantly advanced the field of semantic segmentation by integrating data from multiple sources. However, this task often encounters missing modality scenarios due to challenges such as sensor failures or data transmission errors, which can result in substantial performance degradation. Existing approaches to addressing missing modalities predominantly involve training separate models tailored to specific missing scenarios, typically requiring considerable computational resources. In this paper, we propose a Hierarchical Adaptation framework to Restore Missing Modalities for Multimodal segmentation (HARM3), which enables frozen pretrained multimodal models to be directly applied to missing-modality semantic segmentation tasks with minimal parameter updates. Central to HARM3 is a text-instructed missing modality prompt module, which learns multimodal semantic knowledge by utilizing available modalities and textual instructions to generate prompts for the missing modalities. By incorporating a small set of trainable parameters, this module effectively facilitates knowledge transfer between high-resource domains and low-resource domains where missing modalities are more prevalent. Besides, to further enhance the model’s robustness and adaptability, we introduce adaptive perturbation training and an affine modality adapter. Extensive experimental results demonstrate the effectiveness and robustness of HARM3 across a variety of missing modality scenarios.
{"title":"Incomplete Modalities Restoration via Hierarchical Adaptation for Robust Multimodal Segmentation","authors":"Yujia Sun;Weisheng Dong;Peng Wu;Mingtao Feng;Tao Huang;Xin Li;Guangming Shi","doi":"10.1109/TIP.2025.3642612","DOIUrl":"10.1109/TIP.2025.3642612","url":null,"abstract":"Multimodal semantic segmentation has significantly advanced the field of semantic segmentation by integrating data from multiple sources. However, this task often encounters missing modality scenarios due to challenges such as sensor failures or data transmission errors, which can result in substantial performance degradation. Existing approaches to addressing missing modalities predominantly involve training separate models tailored to specific missing scenarios, typically requiring considerable computational resources. In this paper, we propose a Hierarchical Adaptation framework to Restore Missing Modalities for Multimodal segmentation (HARM3), which enables frozen pretrained multimodal models to be directly applied to missing-modality semantic segmentation tasks with minimal parameter updates. Central to HARM3 is a text-instructed missing modality prompt module, which learns multimodal semantic knowledge by utilizing available modalities and textual instructions to generate prompts for the missing modalities. By incorporating a small set of trainable parameters, this module effectively facilitates knowledge transfer between high-resource domains and low-resource domains where missing modalities are more prevalent. Besides, to further enhance the model’s robustness and adaptability, we introduce adaptive perturbation training and an affine modality adapter. Extensive experimental results demonstrate the effectiveness and robustness of HARM3 across a variety of missing modality scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8672-8683"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1109/TIP.2025.3642527
Jinxiu Zhang;Weidong Min;Jiahao Li;Qing Han
Micro-expressions can reveal genuine emotions that are not easily concealed, making them invaluable in fields such as psychotherapy and criminal interrogation. However, existing pseudo-labeling-based methods for micro-expression analysis have two major limitations. First, pseudo-labels generated by the sliding window do not account for the actual proportion of micro-expressions in the video, which leads to inaccurate labeling. Second, they predominantly focus on overall features, thereby neglecting subtle features. In this paper, we propose a micro-expression analysis method called Spot-Then-Recognize Method (STRM), which integrates spotting and recognition tasks. To address the first limitation, we propose a Self-Adaptive Pseudo-labeling Method (SAPM) that dynamically assigns pseudo-labels to micro-expression frames according to their actual proportion in the video sequence, thereby improving labeling accuracy. To address second limitation, we design a Multi-Scale Residual Channel Attention Network (MSRCAN) to effectively extract subtle micro-expression features. The MSRCAN comprises three modules: Multi-Scale Shared Network (MSSN), Spotting Network, and Recognition Network. The MSSN initially extracts micro-expression features by performing multi-scale feature extraction with Residual Connected Channel Attention Modules (RCCAM), which are then refined in the spotting and recognition networks. We conducted comprehensive experiments on three short video datasets (CASME II, SMIC-E-HS, SMIC-E-NIR) and two long video datasets (CAS(ME)2, SAMMLV). Experimental results show that our proposed method significantly outperforms existing methods, achieving an overall performance of 58.24%, a 19.62% improvement, and a $1.51times $ gain over the baseline in terms of micro-expression analysis.
微表情可以揭示不易隐藏的真实情绪,在心理治疗和刑事审讯等领域具有不可估量的价值。然而,现有的基于伪标注的微表情分析方法存在两大局限性。首先,滑动窗口生成的伪标签没有考虑到微表情在视频中的实际占比,导致标注不准确。其次,他们主要关注整体特征,从而忽略了细微特征。在本文中,我们提出了一种微表情分析方法,称为spot - then - recognition method (STRM),它集成了发现和识别任务。为了解决第一个限制,我们提出了一种自适应伪标记方法(SAPM),该方法根据微表情帧在视频序列中的实际比例动态分配伪标记,从而提高标记精度。为了解决第二个限制,我们设计了一个多尺度剩余通道注意网络(MSRCAN)来有效地提取细微的微表情特征。MSRCAN包括三个模块:多尺度共享网络(MSSN)、发现网络(Spotting Network)和识别网络(Recognition Network)。该方法首先利用残差连通通道注意模块(RCCAM)进行多尺度特征提取,提取微表情特征,然后在发现和识别网络中进行细化。在3个短视频数据集(CASME II、SMIC-E-HS、SMIC-E-NIR)和2个长视频数据集(CAS(ME)2、SAMMLV)上进行了综合实验。实验结果表明,我们提出的方法显著优于现有方法,在微表情分析方面,总体性能提高了58.24%,提高了19.62%,比基线提高了1.51倍。
{"title":"Micro-Expression Analysis Based on Self-Adaptive Pseudo-Labeling and Residual Connected Channel Attention Mechanisms","authors":"Jinxiu Zhang;Weidong Min;Jiahao Li;Qing Han","doi":"10.1109/TIP.2025.3642527","DOIUrl":"10.1109/TIP.2025.3642527","url":null,"abstract":"Micro-expressions can reveal genuine emotions that are not easily concealed, making them invaluable in fields such as psychotherapy and criminal interrogation. However, existing pseudo-labeling-based methods for micro-expression analysis have two major limitations. First, pseudo-labels generated by the sliding window do not account for the actual proportion of micro-expressions in the video, which leads to inaccurate labeling. Second, they predominantly focus on overall features, thereby neglecting subtle features. In this paper, we propose a micro-expression analysis method called Spot-Then-Recognize Method (STRM), which integrates spotting and recognition tasks. To address the first limitation, we propose a Self-Adaptive Pseudo-labeling Method (SAPM) that dynamically assigns pseudo-labels to micro-expression frames according to their actual proportion in the video sequence, thereby improving labeling accuracy. To address second limitation, we design a Multi-Scale Residual Channel Attention Network (MSRCAN) to effectively extract subtle micro-expression features. The MSRCAN comprises three modules: Multi-Scale Shared Network (MSSN), Spotting Network, and Recognition Network. The MSSN initially extracts micro-expression features by performing multi-scale feature extraction with Residual Connected Channel Attention Modules (RCCAM), which are then refined in the spotting and recognition networks. We conducted comprehensive experiments on three short video datasets (CASME II, SMIC-E-HS, SMIC-E-NIR) and two long video datasets (CAS(ME)2, SAMMLV). Experimental results show that our proposed method significantly outperforms existing methods, achieving an overall performance of 58.24%, a 19.62% improvement, and a <inline-formula> <tex-math>$1.51times $ </tex-math></inline-formula> gain over the baseline in terms of micro-expression analysis.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"221-233"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145771081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The substantial successes achieved by diffusion probabilistic models have prompted the study of their employment in resource-limited scenarios. Pruning methods have been proven effective in compressing discriminative models relying on the correlation between training losses and model performances. However, diffusion models employ an iterative process for generating high-quality images, leading to a breakdown of such connections. To address this challenge, we propose a simple yet effective method, named NiCI-Pruning (Noise in Clean Image Pruning), for the compression of diffusion models. NiCI-Pruning capitalizes the noise predicted by the model based on clean image inputs, favoring it as a feature for establishing reconstruction losses. Accordingly, Taylor expansion is employed for the proposed reconstruction loss to evaluate the parameter importance effectively. Moreover, we propose an interval sampling strategy that incorporates a timestep-weighted schema, alleviating the risk of misleading information obtained at later timesteps. We provide comprehensive experimental results to affirm the superiority of our proposed approach. Notably, our method achieves a remarkable average reduction of 30.4% in FID score increase across five different datasets compared to the state-of-the-art diffusion pruning method at equivalent pruning rates. Our code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning
扩散概率模型取得的巨大成功促使人们对其在资源有限情况下的应用进行研究。基于训练损失和模型性能之间的相关性,剪枝方法在压缩判别模型方面被证明是有效的。然而,扩散模型采用迭代过程来生成高质量的图像,导致这种连接的崩溃。为了解决这一挑战,我们提出了一种简单而有效的方法,称为NiCI-Pruning (Noise in Clean Image Pruning),用于压缩扩散模型。NiCI-Pruning利用基于干净图像输入的模型预测的噪声,有利于将其作为建立重建损失的特征。因此,对所提出的重构损失采用泰勒展开,有效地评价了参数的重要性。此外,我们提出了一种包含时间步长加权模式的间隔采样策略,以减轻在后期时间步长获得误导性信息的风险。我们提供了全面的实验结果来证实我们所提出的方法的优越性。值得注意的是,与最先进的扩散修剪方法相比,我们的方法在相同修剪速率下,在五个不同的数据集上,FID评分平均降低了30.4%。我们的代码和模型已在https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning上提供
{"title":"NiCI-Pruning: Enhancing Diffusion Model Pruning via Noise in Clean Image Guidance","authors":"Junzhu Mao;Zeren Sun;Yazhou Yao;Tianfei Zhou;Liqiang Nie;Xiansheng Hua","doi":"10.1109/TIP.2025.3643138","DOIUrl":"10.1109/TIP.2025.3643138","url":null,"abstract":"The substantial successes achieved by diffusion probabilistic models have prompted the study of their employment in resource-limited scenarios. Pruning methods have been proven effective in compressing discriminative models relying on the correlation between training losses and model performances. However, diffusion models employ an iterative process for generating high-quality images, leading to a breakdown of such connections. To address this challenge, we propose a simple yet effective method, named NiCI-Pruning (Noise in Clean Image Pruning), for the compression of diffusion models. NiCI-Pruning capitalizes the noise predicted by the model based on clean image inputs, favoring it as a feature for establishing reconstruction losses. Accordingly, Taylor expansion is employed for the proposed reconstruction loss to evaluate the parameter importance effectively. Moreover, we propose an interval sampling strategy that incorporates a timestep-weighted schema, alleviating the risk of misleading information obtained at later timesteps. We provide comprehensive experimental results to affirm the superiority of our proposed approach. Notably, our method achieves a remarkable average reduction of 30.4% in FID score increase across five different datasets compared to the state-of-the-art diffusion pruning method at equivalent pruning rates. Our code and models have been made available at <uri>https://github.com/NUST-Machine-Intelligence-Laboratory/NiCI-Pruning</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8447-8460"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1109/TIP.2025.3643154
Sara Mandelli;Edoardo Daniele Cannas;Paolo Bestagini;Stefano Tebaldini;Stefano Tubaro
The vast accessibility of Synthetic Aperture Radar (SAR) images through online portals has propelled the research across various fields. This widespread use and easy availability have unfortunately made SAR data susceptible to malicious alterations, such as local editing applied to the images for inserting or covering the presence of sensitive targets. To contrast malicious manipulations, in the last years the forensic community has begun to dig into the SAR manipulation issue, proposing detectors that effectively localize the tampering traces in amplitude images. Nonetheless, in this paper we demonstrate that an expert practitioner can exploit the complex nature of SAR data to obscure any signs of manipulation within a locally altered amplitude image. We refer to this approach as a counter-forensic attack. To achieve the concealment of manipulation traces, the attacker can simulate a re-acquisition of the manipulated scene by the SAR system that initially generated the pristine image. In doing so, the attacker can obscure any evidence of manipulation, making it appear as if the image was legitimately produced by the system. This attack has unique features that make it both highly generalizable and relatively easy to apply. First, it is a black-box attack, meaning it is not designed to deceive a specific forensic detector. Furthermore, it does not require a training phase and is not based on adversarial operations. We assess the effectiveness of the proposed counter-forensic approach across diverse scenarios, examining various manipulation operations. The obtained results indicate that our devised attack successfully eliminates traces of manipulation, deceiving even the most advanced forensic detectors.
{"title":"Hiding Local Manipulations on SAR Images: A Counter-Forensic Attack","authors":"Sara Mandelli;Edoardo Daniele Cannas;Paolo Bestagini;Stefano Tebaldini;Stefano Tubaro","doi":"10.1109/TIP.2025.3643154","DOIUrl":"10.1109/TIP.2025.3643154","url":null,"abstract":"The vast accessibility of Synthetic Aperture Radar (SAR) images through online portals has propelled the research across various fields. This widespread use and easy availability have unfortunately made SAR data susceptible to malicious alterations, such as local editing applied to the images for inserting or covering the presence of sensitive targets. To contrast malicious manipulations, in the last years the forensic community has begun to dig into the SAR manipulation issue, proposing detectors that effectively localize the tampering traces in amplitude images. Nonetheless, in this paper we demonstrate that an expert practitioner can exploit the complex nature of SAR data to obscure any signs of manipulation within a locally altered amplitude image. We refer to this approach as a counter-forensic attack. To achieve the concealment of manipulation traces, the attacker can simulate a re-acquisition of the manipulated scene by the SAR system that initially generated the pristine image. In doing so, the attacker can obscure any evidence of manipulation, making it appear as if the image was legitimately produced by the system. This attack has unique features that make it both highly generalizable and relatively easy to apply. First, it is a black-box attack, meaning it is not designed to deceive a specific forensic detector. Furthermore, it does not require a training phase and is not based on adversarial operations. We assess the effectiveness of the proposed counter-forensic approach across diverse scenarios, examining various manipulation operations. The obtained results indicate that our devised attack successfully eliminates traces of manipulation, deceiving even the most advanced forensic detectors.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8523-8536"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online image super-resolution (SR) services have been widely used in applications such as Remini and DeepAI. However, the exposure of plaintext images raises serious privacy concerns. While secure CNN inference techniques are employed to protect images in image classification, they are not applicable to the unique challenges posed by image SR: the output resolution is significantly higher than that of the input image. In this paper, we present a secure CNN inference scheme for image SR by employing a multiple ciphertext encapsulation method. We begin by designing fundamental homomorphic operations, including addition, multiplication, and rotation across ciphertexts. Recognizing that image SR typically involves an upsampling layer—unlike image classification—we propose a fast algorithm for secure upsampling. This technique leverages pre-weight block masking and cross-ciphertext rotation, resulting in a significant speedup compared to direct homomorphic upsampling. We then present an efficient batched homomorphic two-dimensional convolution method across ciphertexts, incorporating kernel rearrangement and merging strategies. We also design a polynomial activation function specifically optimized for image SR, further enhancing performance. Extensive experiments demonstrate that our HE-friendly SR network outperforms existing secure solutions, while the proposed multiple ciphertext encapsulation technique achieves at least a 2x improvement in both computational efficiency and memory usage.
{"title":"Privacy-Preserving CNN Inference for Image Super-Resolution Cross Multiple Ciphertexts","authors":"Peijia Zheng;Donger Mo;Yufei Zhou;Xiangyu Gao;Xiaochun Cao;Jiwu Huang","doi":"10.1109/TIP.2025.3641310","DOIUrl":"10.1109/TIP.2025.3641310","url":null,"abstract":"Online image super-resolution (SR) services have been widely used in applications such as Remini and DeepAI. However, the exposure of plaintext images raises serious privacy concerns. While secure CNN inference techniques are employed to protect images in image classification, they are not applicable to the unique challenges posed by image SR: the output resolution is significantly higher than that of the input image. In this paper, we present a secure CNN inference scheme for image SR by employing a multiple ciphertext encapsulation method. We begin by designing fundamental homomorphic operations, including addition, multiplication, and rotation across ciphertexts. Recognizing that image SR typically involves an upsampling layer—unlike image classification—we propose a fast algorithm for secure upsampling. This technique leverages pre-weight block masking and cross-ciphertext rotation, resulting in a significant speedup compared to direct homomorphic upsampling. We then present an efficient batched homomorphic two-dimensional convolution method across ciphertexts, incorporating kernel rearrangement and merging strategies. We also design a polynomial activation function specifically optimized for image SR, further enhancing performance. Extensive experiments demonstrate that our HE-friendly SR network outperforms existing secure solutions, while the proposed multiple ciphertext encapsulation technique achieves at least a 2x improvement in both computational efficiency and memory usage.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8568-8582"},"PeriodicalIF":13.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1109/TIP.2025.3641303
Jun Wan;Min Gan;Lefei Zhang;Jie Zhou;Jun Liu;Bo Du;C. L. Philip Chen
The CLIP visual feature-based image captioning models have developed rapidly and achieved remarkable results. However, existing models still struggle to produce descriptive and discriminative captions because they insufficiently exploit fine-grained visual cues and fail to model complex vision–language alignment. To address these limitations, we propose a Ranking Diffusion Transformer (RDT), which integrates a Ranking Visual Encoder (RVE) and a Ranking Loss (RL) for fine-grained image captioning. The RVE introduces a novel ranking attention mechanism that effectively mines diverse and discriminative visual information from CLIP features. Meanwhile, the RL leverages the ranking of generated caption quality as a global semantic supervisory signal, thereby enhancing the diffusion process and strengthening vision–language semantic alignment. We show that by collaborating RVE and RL via the novel RDT—and by gradually adding and removing noise in the diffusion process—more discriminative visual features are learned and precisely aligned with the language features. Experimental results on popular benchmark datasets demonstrate that our proposed RDT surpasses existing state-of-the-art image captioning models in the literature. The code is publicly available at: https://github.com/junwan2014/RDT
{"title":"Fine-Grained Image Captioning by Ranking Diffusion Transformer","authors":"Jun Wan;Min Gan;Lefei Zhang;Jie Zhou;Jun Liu;Bo Du;C. L. Philip Chen","doi":"10.1109/TIP.2025.3641303","DOIUrl":"10.1109/TIP.2025.3641303","url":null,"abstract":"The CLIP visual feature-based image captioning models have developed rapidly and achieved remarkable results. However, existing models still struggle to produce descriptive and discriminative captions because they insufficiently exploit fine-grained visual cues and fail to model complex vision–language alignment. To address these limitations, we propose a Ranking Diffusion Transformer (RDT), which integrates a Ranking Visual Encoder (RVE) and a Ranking Loss (RL) for fine-grained image captioning. The RVE introduces a novel ranking attention mechanism that effectively mines diverse and discriminative visual information from CLIP features. Meanwhile, the RL leverages the ranking of generated caption quality as a global semantic supervisory signal, thereby enhancing the diffusion process and strengthening vision–language semantic alignment. We show that by collaborating RVE and RL via the novel RDT—and by gradually adding and removing noise in the diffusion process—more discriminative visual features are learned and precisely aligned with the language features. Experimental results on popular benchmark datasets demonstrate that our proposed RDT surpasses existing state-of-the-art image captioning models in the literature. The code is publicly available at: <uri>https://github.com/junwan2014/RDT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8332-8344"},"PeriodicalIF":13.7,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145759515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current object detectors often suffer performance degradation when applied to cross-domain scenarios, particularly under challenging visual conditions such as nighttime scenes. This is primarily due to the I3 problems: Inadequate sampling of instance-level features, Indistinguishable feature representation across domains and Inaccurate generation for identical category participation. To address these challenges, we propose a domain-adaptive detection framework that enables robust generalization across different visual domains without introducing any additional inference overhead. The framework comprises three key components. Specifically, the centerness–category consistency sampler alleviates inadequate sampling by selecting representative instance-level features, while the paired centerness consistency loss enforces alignment between classification and localization. Second, VLM-based orthogonality enhancement leverages frozen vision–language encoders with an orthogonal projection loss to improve cross-domain feature distinguishability. Third, hallucination feature generator synthesizes robust instance-level features for missing categories, ensuring balanced category participation across domains. Extensive experiments on multiple datasets covering various domain adaptation and generalization settings demonstrate that our method consistently outperforms state-of-the-art detectors, achieving up to 5.5 mAP improvement, with particularly strong gains in nighttime adaptation.
{"title":"Vision–Language Models Empowered Nighttime Object Detection With Consistency Sampler and Hallucination Feature Generator","authors":"Lihuo He;Junjie Ke;Zhenghao Wang;Jie Li;Kai Zhou;Qi Wang;Xinbo Gao","doi":"10.1109/TIP.2025.3641316","DOIUrl":"10.1109/TIP.2025.3641316","url":null,"abstract":"Current object detectors often suffer performance degradation when applied to cross-domain scenarios, particularly under challenging visual conditions such as nighttime scenes. This is primarily due to the I3 problems: Inadequate sampling of instance-level features, Indistinguishable feature representation across domains and Inaccurate generation for identical category participation. To address these challenges, we propose a domain-adaptive detection framework that enables robust generalization across different visual domains without introducing any additional inference overhead. The framework comprises three key components. Specifically, the centerness–category consistency sampler alleviates inadequate sampling by selecting representative instance-level features, while the paired centerness consistency loss enforces alignment between classification and localization. Second, VLM-based orthogonality enhancement leverages frozen vision–language encoders with an orthogonal projection loss to improve cross-domain feature distinguishability. Third, hallucination feature generator synthesizes robust instance-level features for missing categories, ensuring balanced category participation across domains. Extensive experiments on multiple datasets covering various domain adaptation and generalization settings demonstrate that our method consistently outperforms state-of-the-art detectors, achieving up to 5.5 mAP improvement, with particularly strong gains in nighttime adaptation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8345-8360"},"PeriodicalIF":13.7,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145759517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Vision Transformer (ViT) has achieved remarkable success in computer vision due to its powerful token mixer, which effectively captures global dependencies among all tokens. However, the quadratic complexity of standard self-attention with respect to the number of tokens severely hampers its computational efficiency in practical deployment. Although recent hybrid approaches have sought to combine the strengths of convolutions and self-attention to improve the performance–efficiency trade-off, the costly pairwise token interactions and heavy matrix operations in conventional self-attention remain a critical bottleneck. To overcome this limitation, we introduce S2AFormer, an efficient Vision Transformer architecture built around a novel Strip Self-Attention (SSA) mechanism. Our design incorporates lightweight yet effective Hybrid Perception Blocks (HPBs) that seamlessly fuse the local inductive biases of CNNs with the global modeling capability of Transformer-style attention. The core innovation of SSA lies in simultaneously reducing the spatial resolution of the key ($K$ ) and value ($V$ ) tensors while compressing the channel dimension of the query ($Q$ ) and key ($K$ ) tensors. This joint spatial-and-channel compression dramatically lowers computational cost without sacrificing representational power, achieving an excellent balance between accuracy and efficiency. We extensively evaluate S2AFormer on a wide range of vision tasks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), and object detection/instance segmentation (COCO). Experimental results consistently show that S2AFormer delivers substantial accuracy improvements together with superior inference speed and throughput across both GPU and non-GPU platforms, establishing it as a highly competitive solution in the landscape of efficient Vision Transformers.
{"title":"S2AFormer: Strip Self-Attention for Efficient Vision Transformer","authors":"Guoan Xu;Wenfeng Huang;Wenjing Jia;Jiamao Li;Guangwei Gao;Guo-Jun Qi","doi":"10.1109/TIP.2025.3639919","DOIUrl":"10.1109/TIP.2025.3639919","url":null,"abstract":"The Vision Transformer (ViT) has achieved remarkable success in computer vision due to its powerful token mixer, which effectively captures global dependencies among all tokens. However, the quadratic complexity of standard self-attention with respect to the number of tokens severely hampers its computational efficiency in practical deployment. Although recent hybrid approaches have sought to combine the strengths of convolutions and self-attention to improve the performance–efficiency trade-off, the costly pairwise token interactions and heavy matrix operations in conventional self-attention remain a critical bottleneck. To overcome this limitation, we introduce S2AFormer, an efficient Vision Transformer architecture built around a novel Strip Self-Attention (SSA) mechanism. Our design incorporates lightweight yet effective Hybrid Perception Blocks (HPBs) that seamlessly fuse the local inductive biases of CNNs with the global modeling capability of Transformer-style attention. The core innovation of SSA lies in simultaneously reducing the spatial resolution of the key (<inline-formula> <tex-math>$K$ </tex-math></inline-formula>) and value (<inline-formula> <tex-math>$V$ </tex-math></inline-formula>) tensors while compressing the channel dimension of the query (<inline-formula> <tex-math>$Q$ </tex-math></inline-formula>) and key (<inline-formula> <tex-math>$K$ </tex-math></inline-formula>) tensors. This joint spatial-and-channel compression dramatically lowers computational cost without sacrificing representational power, achieving an excellent balance between accuracy and efficiency. We extensively evaluate S2AFormer on a wide range of vision tasks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), and object detection/instance segmentation (COCO). Experimental results consistently show that S2AFormer delivers substantial accuracy improvements together with superior inference speed and throughput across both GPU and non-GPU platforms, establishing it as a highly competitive solution in the landscape of efficient Vision Transformers.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8243-8256"},"PeriodicalIF":13.7,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145728828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-11DOI: 10.1109/TIP.2025.3641052
Jiachuan Yu;Han Sun;Yuankai Zhou;Xiaowei Jiang
This paper presents a robust, decoupled approach to camera distortion correction using a rational function model (RFM), designed to address challenges in accuracy and flexibility within precision-critical applications. Camera distortion is a pervasive issue in fields such as medical imaging, robotics, and 3D reconstruction, where high fidelity and geometric accuracy are crucial. Traditional distortion correction methods rely on radial-symmetry-based models, which have limited precision under tangential distortion and require nonlinear optimization. In contrast, general models do not rely on radial symmetry geometry and are theoretically generalizable to various sources of distortion. There exists a gap between the theoretical precision advantage of the Rational Function Model (RFM) and its practical applicability in real-world scenarios. This gap arises from uncertainties regarding the model’s robustness to noise, the impact of sparse sample distributions, and its generalizability out of the training sample range. In this paper, we provide a mathematical interpretation of how RFM is suitable for the distortion correction problem through sensitivity analysis. The precision and robustness of RFM are evaluated through synthetic and real-world experiments, considering distortion level, noise level, and sample distribution. Moreover, a practical and accurate decoupled distortion correction method is proposed using just a single captured image of a chessboard pattern. The correction performance is compared with the current state-of-the-art using camera calibration, and experimental results indicate that more precise distortion correction can enhance the overall accuracy of camera calibration. In summary, this decoupled RFM-based distortion correction approach provides a flexible, high-precision solution for applications requiring minimal calibration steps and reliable geometric accuracy, establishing a foundation for distortion-free imaging and simplified camera models in precision-driven computer vision tasks.
本文提出了一种使用理性函数模型(RFM)的鲁棒解耦相机畸变校正方法,旨在解决精度关键应用中精度和灵活性方面的挑战。在医学成像、机器人和3D重建等领域,相机失真是一个普遍存在的问题,在这些领域,高保真度和几何精度至关重要。传统的畸变校正方法依赖于基于径向对称的模型,在切向畸变下精度有限,且需要非线性优化。相比之下,一般模型不依赖于径向对称几何,理论上可以推广到各种失真源。在Rational Function Model (RFM)的理论精度优势和它在现实场景中的实际适用性之间存在着差距。这种差距源于模型对噪声的鲁棒性、稀疏样本分布的影响以及其在训练样本范围外的泛化性的不确定性。在本文中,我们通过灵敏度分析提供了RFM如何适用于失真校正问题的数学解释。通过综合考虑失真水平、噪声水平和样本分布,对RFM的精度和鲁棒性进行了评价。此外,本文还提出了一种实用且精确的解耦畸变校正方法。实验结果表明,更精确的畸变校正可以提高摄像机标定的整体精度。总之,这种解耦的基于rfm的畸变校正方法为需要最小校准步骤和可靠几何精度的应用提供了灵活、高精度的解决方案,为精确驱动的计算机视觉任务中的无畸变成像和简化相机模型奠定了基础。
{"title":"High-Precision Camera Distortion Correction: A Decoupled Approach With Rational Functions","authors":"Jiachuan Yu;Han Sun;Yuankai Zhou;Xiaowei Jiang","doi":"10.1109/TIP.2025.3641052","DOIUrl":"10.1109/TIP.2025.3641052","url":null,"abstract":"This paper presents a robust, decoupled approach to camera distortion correction using a rational function model (RFM), designed to address challenges in accuracy and flexibility within precision-critical applications. Camera distortion is a pervasive issue in fields such as medical imaging, robotics, and 3D reconstruction, where high fidelity and geometric accuracy are crucial. Traditional distortion correction methods rely on radial-symmetry-based models, which have limited precision under tangential distortion and require nonlinear optimization. In contrast, general models do not rely on radial symmetry geometry and are theoretically generalizable to various sources of distortion. There exists a gap between the theoretical precision advantage of the Rational Function Model (RFM) and its practical applicability in real-world scenarios. This gap arises from uncertainties regarding the model’s robustness to noise, the impact of sparse sample distributions, and its generalizability out of the training sample range. In this paper, we provide a mathematical interpretation of how RFM is suitable for the distortion correction problem through sensitivity analysis. The precision and robustness of RFM are evaluated through synthetic and real-world experiments, considering distortion level, noise level, and sample distribution. Moreover, a practical and accurate decoupled distortion correction method is proposed using just a single captured image of a chessboard pattern. The correction performance is compared with the current state-of-the-art using camera calibration, and experimental results indicate that more precise distortion correction can enhance the overall accuracy of camera calibration. In summary, this decoupled RFM-based distortion correction approach provides a flexible, high-precision solution for applications requiring minimal calibration steps and reliable geometric accuracy, establishing a foundation for distortion-free imaging and simplified camera models in precision-driven computer vision tasks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"290-304"},"PeriodicalIF":13.7,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145728417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}