Pub Date : 2025-12-22DOI: 10.1109/TIP.2025.3644140
Zheren Fu;Zhendong Mao;Lei Zhang;Yongdong Zhang
Multimodal Large Language Models (MLLMs) exhibit impressive performance across vision-language tasks, but still face the hallucination challenges, where generated texts are factually inconsistent with visual input. Existing mitigation methods focus on surface symptoms of hallucination and heavily rely on post-hoc corrections, extensive data curation, or costly inference schemes. In this work, we identify two key factors of MLLM hallucination: Insufficient Visual Context, where ambiguous visual contexts lead to language speculation, and Progressive Textual Drift, where model attention strays from visual inputs in longer responses. To address these problems, we propose a novel Complementary Visual Grounding (CVG) framework. CVG exploits the intrinsic architecture of MLLMs, without requiring any external tools, models, or additional data. CVG first disentangles visual context into two complementary branches based on query relevance, then maintains steadfast visual grounding during the auto-regressive generation. Finally, it contrasts the output distributions of two branches to produce a faithful response. Extensive experiments on various hallucination and general benchmarks demonstrate that CVG achieves state-of-the-art performances across MLLM architectures and scales.
多模态大型语言模型(Multimodal Large Language Models, mllm)在视觉语言任务中表现出令人印象深刻的性能,但仍然面临幻觉的挑战,即生成的文本实际上与视觉输入不一致。现有的缓解方法侧重于幻觉的表面症状,严重依赖于事后纠正、广泛的数据管理或昂贵的推理方案。在这项工作中,我们确定了MLLM幻觉的两个关键因素:视觉上下文不足,其中模糊的视觉上下文导致语言猜测,以及渐进式文本漂移,其中模型注意力在较长的响应中偏离视觉输入。为了解决这些问题,我们提出了一个新的互补视觉基础(CVG)框架。CVG利用了mlm的内在架构,而不需要任何外部工具、模型或额外的数据。CVG首先根据查询相关性将视觉上下文分解为两个互补的分支,然后在自动回归生成过程中保持稳定的视觉基础。最后,对比两个分支的输出分布,以产生忠实的响应。对各种幻觉和一般基准的广泛实验表明,CVG在MLLM架构和规模上实现了最先进的性能。
{"title":"Boosting Faithful Multi-Modal LLMs via Complementary Visual Grounding","authors":"Zheren Fu;Zhendong Mao;Lei Zhang;Yongdong Zhang","doi":"10.1109/TIP.2025.3644140","DOIUrl":"10.1109/TIP.2025.3644140","url":null,"abstract":"Multimodal Large Language Models (MLLMs) exhibit impressive performance across vision-language tasks, but still face the hallucination challenges, where generated texts are factually inconsistent with visual input. Existing mitigation methods focus on surface symptoms of hallucination and heavily rely on post-hoc corrections, extensive data curation, or costly inference schemes. In this work, we identify two key factors of MLLM hallucination: Insufficient Visual Context, where ambiguous visual contexts lead to language speculation, and Progressive Textual Drift, where model attention strays from visual inputs in longer responses. To address these problems, we propose a novel Complementary Visual Grounding (CVG) framework. CVG exploits the intrinsic architecture of MLLMs, without requiring any external tools, models, or additional data. CVG first disentangles visual context into two complementary branches based on query relevance, then maintains steadfast visual grounding during the auto-regressive generation. Finally, it contrasts the output distributions of two branches to produce a faithful response. Extensive experiments on various hallucination and general benchmarks demonstrate that CVG achieves state-of-the-art performances across MLLM architectures and scales.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8641-8655"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1109/TIP.2025.3644793
Kareth M. León-López;Abderrahim Halimi;Jean-Yves Tourneret;Herwig Wendt
Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.
{"title":"Bayesian Multifractal Image Segmentation","authors":"Kareth M. León-López;Abderrahim Halimi;Jean-Yves Tourneret;Herwig Wendt","doi":"10.1109/TIP.2025.3644793","DOIUrl":"10.1109/TIP.2025.3644793","url":null,"abstract":"Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8500-8510"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A low-light colonoscopy video enhancement method is needed as poor illumination in colonoscopy can hinder accurate disease diagnosis and adversely affect surgical procedures. Existing low-light video enhancement methods usually apply a frame-by-frame enhancement strategy without considering the temporal correlation between them, which often causes a flickering problem. In addition, most methods are designed for endoscopic devices with fixed imaging styles and cannot be easily adapted to different devices. In this paper, we propose a Style-Guided Network (SGNet) for unpaired Low-Light Colonoscopy Video Enhancement (LLCVE). Given that collecting content-consistent paired videos is difficult, SGNet adopts a CycleGAN-based framework to convert low-light videos to normal-light videos, in which a Temporal Compensation (TC) module and a Style Guidance (SG) module are proposed to alleviate the flickering problem and achieve flexible style transfer, respectively. The TC module compensates for a low-light frame by learning the correlated feature of its adjacent frames, thereby improving the temporal smoothness of the enhanced video. The SG module encodes the text of the imaging style and adaptively explores its intrinsic relationships with video features to obtain style representations, which are then used to guide the subsequent enhancement process. Extensive experiments on a curated database show that SGNet achieves promising performance on the LLCVE task, outperforming state-of-the-art methods in both quantitative metrics and visual quality.
{"title":"SGNet: Style-Guided Network With Temporal Compensation for Unpaired Low-Light Colonoscopy Video Enhancement","authors":"Guanghui Yue;Lixin Zhang;Wanqing Liu;Jingfeng Du;Tianwei Zhou;Hanhe Lin;Qiuping Jiang;Wenqi Ren","doi":"10.1109/TIP.2025.3644172","DOIUrl":"10.1109/TIP.2025.3644172","url":null,"abstract":"A low-light colonoscopy video enhancement method is needed as poor illumination in colonoscopy can hinder accurate disease diagnosis and adversely affect surgical procedures. Existing low-light video enhancement methods usually apply a frame-by-frame enhancement strategy without considering the temporal correlation between them, which often causes a flickering problem. In addition, most methods are designed for endoscopic devices with fixed imaging styles and cannot be easily adapted to different devices. In this paper, we propose a Style-Guided Network (SGNet) for unpaired Low-Light Colonoscopy Video Enhancement (LLCVE). Given that collecting content-consistent paired videos is difficult, SGNet adopts a CycleGAN-based framework to convert low-light videos to normal-light videos, in which a Temporal Compensation (TC) module and a Style Guidance (SG) module are proposed to alleviate the flickering problem and achieve flexible style transfer, respectively. The TC module compensates for a low-light frame by learning the correlated feature of its adjacent frames, thereby improving the temporal smoothness of the enhanced video. The SG module encodes the text of the imaging style and adaptively explores its intrinsic relationships with video features to obtain style representations, which are then used to guide the subsequent enhancement process. Extensive experiments on a curated database show that SGNet achieves promising performance on the LLCVE task, outperforming state-of-the-art methods in both quantitative metrics and visual quality.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"234-246"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose A3-TTA, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at https://github.com/HiLab-git/A3-TTA
{"title":"A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation","authors":"Jianghao Wu;Xiangde Luo;Yubo Zhou;Lianming Wu;Guotai Wang;Shaoting Zhang","doi":"10.1109/TIP.2025.3644789","DOIUrl":"10.1109/TIP.2025.3644789","url":null,"abstract":"Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose A3-TTA, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at <uri>https://github.com/HiLab-git/A3-TTA</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8511-8522"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: 1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. 2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
{"title":"PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation","authors":"Mengyuan Liu;Jiajie Liu;Jinyan Zhang;Wenhao Li;Junsong Yuan","doi":"10.1109/TIP.2025.3644785","DOIUrl":"10.1109/TIP.2025.3644785","url":null,"abstract":"The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: 1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. 2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8537-8551"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Medical image restoration (MedIR) aims to recover high-quality images from degraded inputs, yet faces unique challenges from physics-driven degradations and multi-modal task interference. While existing all-in-one methods handle natural image degradations well, they struggle with medical scenarios due to limited degradation perception and suboptimal multi-task optimization. In response, we introduce DaPT, a Degradation-aware Prompted Transformer, which integrates dynamic prompt learning and modular expert mining for unified MedIR. First, DaPT introduces spatially compact prompts with optimal transport regularization, amplifying inter-prompt differences to capture diverse degradation patterns. Second, a mixture of experts dynamically routes inputs to specialized modules via prompt guidance, resolving task conflicts while reducing computational overhead. The synergy of prompt learning and expert mining further enables robust restoration across multi-modal medical data, offering a practical solution for clinical imaging. Extensive experiments across multiple modalities (MRI, CT, PET) and diverse degradations, covering both in-distribution and out-of-distribution scenarios, demonstrate that DaPT consistently outperforms state-of-the-art methods and generalizes reliably to unseen settings, underscoring its robustness, effectiveness, and clinical practicality. The source code will be released at https://github.com/weijinbao1998/DaPT
{"title":"Degradation-Aware Prompted Transformer for Unified Medical Image Restoration","authors":"Jinbao Wei;Gang Yang;Zhijie Wang;Shimin Tao;Aiping Liu;Xun Chen","doi":"10.1109/TIP.2025.3644795","DOIUrl":"10.1109/TIP.2025.3644795","url":null,"abstract":"Medical image restoration (MedIR) aims to recover high-quality images from degraded inputs, yet faces unique challenges from physics-driven degradations and multi-modal task interference. While existing all-in-one methods handle natural image degradations well, they struggle with medical scenarios due to limited degradation perception and suboptimal multi-task optimization. In response, we introduce DaPT, a Degradation-aware Prompted Transformer, which integrates dynamic prompt learning and modular expert mining for unified MedIR. First, DaPT introduces spatially compact prompts with optimal transport regularization, amplifying inter-prompt differences to capture diverse degradation patterns. Second, a mixture of experts dynamically routes inputs to specialized modules via prompt guidance, resolving task conflicts while reducing computational overhead. The synergy of prompt learning and expert mining further enables robust restoration across multi-modal medical data, offering a practical solution for clinical imaging. Extensive experiments across multiple modalities (MRI, CT, PET) and diverse degradations, covering both in-distribution and out-of-distribution scenarios, demonstrate that DaPT consistently outperforms state-of-the-art methods and generalizes reliably to unseen settings, underscoring its robustness, effectiveness, and clinical practicality. The source code will be released at <uri>https://github.com/weijinbao1998/DaPT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8583-8598"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1109/TIP.2025.3644167
Yuxin Feng;Jufeng Li;Tao Huang;Fangfang Wu;Yakun Ju;Chunxu Li;Weisheng Dong;Alex C. Kot
Current deep learning-based methods for remote sensing image dehazing have developed rapidly, yet they still commonly struggle to simultaneously preserve fine texture details and restore accurate colors. The fundamental reason lies in the insufficient modeling of high-frequency information that captures structural details, as well as the lack of effective constraints for color restoration. To address the insufficient modeling of global high-frequency information, we first develop an omni-directional high-frequency feature in painting mechanism that leverages the wavelet transform to extract multi-directional high-frequency components. While maintaining the advantage of linear complexity, it models global long-range texture dependencies through cross-frequency perception. Then, to further strengthen local high-frequency representation, we design a high-frequency prompt attention module that dynamically injects wavelet-domain optimized high-frequency features as cross-level guidance signals, significantly enhancing the model’s capability in edge sharpness restoration and texture detail reconstruction. Further, to alleviate the problem of inaccurate color restoration, we propose a color contrast loss function based on the HSV color space, which explicitly models the statistical distribution differences of brightness and saturation in hazy regions, guiding the model to generate dehazed images with consistent colors and natural visual appearance. Finally, extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing approaches in both texture detail restoration and color consistency. Further results and code are available at: https://github.com/fyxnl/C4RSD
{"title":"Cross-Frequency Attention and Color Contrast Constraint for Remote Sensing Dehazing","authors":"Yuxin Feng;Jufeng Li;Tao Huang;Fangfang Wu;Yakun Ju;Chunxu Li;Weisheng Dong;Alex C. Kot","doi":"10.1109/TIP.2025.3644167","DOIUrl":"10.1109/TIP.2025.3644167","url":null,"abstract":"Current deep learning-based methods for remote sensing image dehazing have developed rapidly, yet they still commonly struggle to simultaneously preserve fine texture details and restore accurate colors. The fundamental reason lies in the insufficient modeling of high-frequency information that captures structural details, as well as the lack of effective constraints for color restoration. To address the insufficient modeling of global high-frequency information, we first develop an omni-directional high-frequency feature in painting mechanism that leverages the wavelet transform to extract multi-directional high-frequency components. While maintaining the advantage of linear complexity, it models global long-range texture dependencies through cross-frequency perception. Then, to further strengthen local high-frequency representation, we design a high-frequency prompt attention module that dynamically injects wavelet-domain optimized high-frequency features as cross-level guidance signals, significantly enhancing the model’s capability in edge sharpness restoration and texture detail reconstruction. Further, to alleviate the problem of inaccurate color restoration, we propose a color contrast loss function based on the HSV color space, which explicitly models the statistical distribution differences of brightness and saturation in hazy regions, guiding the model to generate dehazed images with consistent colors and natural visual appearance. Finally, extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing approaches in both texture detail restoration and color consistency. Further results and code are available at: <uri>https://github.com/fyxnl/C4RSD</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8552-8567"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1109/TIP.2025.3609135
Bowen Ma;Tong Jia;Hao Wang;Dongyue Chen
Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.
{"title":"Meta-TIP: An Unsupervised End-to-End Fusion Network for Multi-Dataset Style-Adaptive Threat Image Projection","authors":"Bowen Ma;Tong Jia;Hao Wang;Dongyue Chen","doi":"10.1109/TIP.2025.3609135","DOIUrl":"https://doi.org/10.1109/TIP.2025.3609135","url":null,"abstract":"Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8317-8331"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.
{"title":"Face Forgery Detection With CLIP-Enhanced Multi-Encoder Distillation","authors":"Chunlei Peng;Tianzhe Yan;Decheng Liu;Nannan Wang;Ruimin Hu;Xinbo Gao","doi":"10.1109/TIP.2025.3644125","DOIUrl":"10.1109/TIP.2025.3644125","url":null,"abstract":"With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8474-8484"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1109/TIP.2025.3644231
Yifan Zhu;Yan Wang;Xinghui Dong
With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at https://github.com/INDTLab/TG-TSGNet). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.
随着增强现实、虚拟现实、地理制图等领域对地形可视化需求的不断增加,传统的地形场景建模方法在处理效率、内容真实感、语义一致性等方面面临巨大挑战。为了解决这些挑战,我们提出了一个文本引导的任意分辨率地形场景生成网络(TG-TSGNet),它包含一个ConvMamba-VQGAN、一个文本引导子网络和一个任意分辨率图像超分辨率模块(ARSRM)。ConvMamba-VQGAN建立在基于convbased Local Representation Block (CLRB)和基于mamba Global Representation Block (MGRB)的基础上,利用局部和全局特征。此外,文本引导子网络包括一个文本编码器和一个文本-图像对齐模块(TIAM),以便将文本语义整合到图像表示中。此外,ARSRM可以与ConvMamba-VQGAN一起训练,以完成图像超分辨率的任务。为了完成文本引导的地形场景生成任务,我们为自然地形场景数据集(NTSD)的38个类别的36,672幅图像导出了一组文本描述。这些描述可用于训练和测试TG-TSGNet(数据集、模型和源代码可在https://github.com/INDTLab/TG-TSGNet上获得)。实验结果表明,TG-TSGNet在图像真实感和语义一致性方面优于或至少与基线方法相当,并且具有适当的效率。我们认为,由于TG-TSGNet不仅能够捕获地形场景的局部和全局特征以及语义,而且还可以降低图像生成的计算成本,因此具有良好的性能。
{"title":"TG-TSGNet: A Text-Guided Arbitrary-Resolution Terrain Scene Generation Network","authors":"Yifan Zhu;Yan Wang;Xinghui Dong","doi":"10.1109/TIP.2025.3644231","DOIUrl":"10.1109/TIP.2025.3644231","url":null,"abstract":"With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at <uri>https://github.com/INDTLab/TG-TSGNet</uri>). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8614-8626"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}