首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
Boosting Faithful Multi-Modal LLMs via Complementary Visual Grounding 通过互补的视觉基础增强忠实的多模态llm
IF 13.7 Pub Date : 2025-12-22 DOI: 10.1109/TIP.2025.3644140
Zheren Fu;Zhendong Mao;Lei Zhang;Yongdong Zhang
Multimodal Large Language Models (MLLMs) exhibit impressive performance across vision-language tasks, but still face the hallucination challenges, where generated texts are factually inconsistent with visual input. Existing mitigation methods focus on surface symptoms of hallucination and heavily rely on post-hoc corrections, extensive data curation, or costly inference schemes. In this work, we identify two key factors of MLLM hallucination: Insufficient Visual Context, where ambiguous visual contexts lead to language speculation, and Progressive Textual Drift, where model attention strays from visual inputs in longer responses. To address these problems, we propose a novel Complementary Visual Grounding (CVG) framework. CVG exploits the intrinsic architecture of MLLMs, without requiring any external tools, models, or additional data. CVG first disentangles visual context into two complementary branches based on query relevance, then maintains steadfast visual grounding during the auto-regressive generation. Finally, it contrasts the output distributions of two branches to produce a faithful response. Extensive experiments on various hallucination and general benchmarks demonstrate that CVG achieves state-of-the-art performances across MLLM architectures and scales.
多模态大型语言模型(Multimodal Large Language Models, mllm)在视觉语言任务中表现出令人印象深刻的性能,但仍然面临幻觉的挑战,即生成的文本实际上与视觉输入不一致。现有的缓解方法侧重于幻觉的表面症状,严重依赖于事后纠正、广泛的数据管理或昂贵的推理方案。在这项工作中,我们确定了MLLM幻觉的两个关键因素:视觉上下文不足,其中模糊的视觉上下文导致语言猜测,以及渐进式文本漂移,其中模型注意力在较长的响应中偏离视觉输入。为了解决这些问题,我们提出了一个新的互补视觉基础(CVG)框架。CVG利用了mlm的内在架构,而不需要任何外部工具、模型或额外的数据。CVG首先根据查询相关性将视觉上下文分解为两个互补的分支,然后在自动回归生成过程中保持稳定的视觉基础。最后,对比两个分支的输出分布,以产生忠实的响应。对各种幻觉和一般基准的广泛实验表明,CVG在MLLM架构和规模上实现了最先进的性能。
{"title":"Boosting Faithful Multi-Modal LLMs via Complementary Visual Grounding","authors":"Zheren Fu;Zhendong Mao;Lei Zhang;Yongdong Zhang","doi":"10.1109/TIP.2025.3644140","DOIUrl":"10.1109/TIP.2025.3644140","url":null,"abstract":"Multimodal Large Language Models (MLLMs) exhibit impressive performance across vision-language tasks, but still face the hallucination challenges, where generated texts are factually inconsistent with visual input. Existing mitigation methods focus on surface symptoms of hallucination and heavily rely on post-hoc corrections, extensive data curation, or costly inference schemes. In this work, we identify two key factors of MLLM hallucination: Insufficient Visual Context, where ambiguous visual contexts lead to language speculation, and Progressive Textual Drift, where model attention strays from visual inputs in longer responses. To address these problems, we propose a novel Complementary Visual Grounding (CVG) framework. CVG exploits the intrinsic architecture of MLLMs, without requiring any external tools, models, or additional data. CVG first disentangles visual context into two complementary branches based on query relevance, then maintains steadfast visual grounding during the auto-regressive generation. Finally, it contrasts the output distributions of two branches to produce a faithful response. Extensive experiments on various hallucination and general benchmarks demonstrate that CVG achieves state-of-the-art performances across MLLM architectures and scales.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8641-8655"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Multifractal Image Segmentation 贝叶斯多重分形图像分割
IF 13.7 Pub Date : 2025-12-22 DOI: 10.1109/TIP.2025.3644793
Kareth M. León-López;Abderrahim Halimi;Jean-Yves Tourneret;Herwig Wendt
Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.
多重分形分析(multiple fractal analysis, MFA)基于多重分形谱描述图像纹理局部规律性的空间波动,为图像纹理的全局表征提供了一个框架。一些作品已经表明了使用MFA来描述图像中均匀纹理的兴趣。然而,自然图像可以由几种纹理组成,反过来,与这些纹理相关的多重分形属性。本文介绍了一种无监督贝叶斯多重分形分割方法,通过在像素级上对图像上的多重分形参数和标记进行联合估计,对多重分形纹理进行建模和分割。为此,首先建立了一种计算和统计效率高的小波前导多重分形参数估计模型,对图像的不同区域定义不同的多重分形参数。然后,引入多尺度波茨马尔可夫随机场作为先验模型,对小波前导标签之间固有的空间和尺度相关性(称为跨尺度相关性)进行建模。最后采用Gibbs抽样方法从未知模型参数的后验分布中抽取样本。对合成的多重分形图像进行了数值实验,以评价该分割方法的性能。与传统的无监督分割技术和现代基于深度学习的方法相比,该方法取得了更好的性能,显示了其对多重分形图像分割的有效性。
{"title":"Bayesian Multifractal Image Segmentation","authors":"Kareth M. León-López;Abderrahim Halimi;Jean-Yves Tourneret;Herwig Wendt","doi":"10.1109/TIP.2025.3644793","DOIUrl":"10.1109/TIP.2025.3644793","url":null,"abstract":"Multifractal analysis (MFA) provides a framework for the global characterization of image textures by describing the spatial fluctuations of their local regularity based on the multifractal spectrum. Several works have shown the interest of using MFA for the description of homogeneous textures in images. Nevertheless, natural images can be composed of several textures and, in turn, multifractal properties associated with those textures. This paper introduces an unsupervised Bayesian multifractal segmentation method to model and segment multifractal textures by jointly estimating the multifractal parameters and labels on images, at the pixel-level. For this, a computationally and statistically efficient multifractal parameter estimation model for wavelet leaders is firstly developed, defining different multifractality parameters for different regions of an image. Then, a multiscale Potts Markov random field is introduced as a prior to model the inherent spatial and scale correlations (referred to as cross-scale correlations) between the labels of the wavelet leaders. A Gibbs sampling methodology is finally used to draw samples from the posterior distribution of the unknown model parameters. Numerical experiments are conducted on synthetic multifractal images to evaluate the performance of the proposed segmentation approach. The proposed method achieves superior performance compared to traditional unsupervised segmentation techniques as well as modern deep learning-based approaches, showing its effectiveness for multifractal image segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8500-8510"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SGNet: Style-Guided Network With Temporal Compensation for Unpaired Low-Light Colonoscopy Video Enhancement 基于时间补偿的风格引导网络在非配对低光结肠镜视频增强中的应用
IF 13.7 Pub Date : 2025-12-22 DOI: 10.1109/TIP.2025.3644172
Guanghui Yue;Lixin Zhang;Wanqing Liu;Jingfeng Du;Tianwei Zhou;Hanhe Lin;Qiuping Jiang;Wenqi Ren
A low-light colonoscopy video enhancement method is needed as poor illumination in colonoscopy can hinder accurate disease diagnosis and adversely affect surgical procedures. Existing low-light video enhancement methods usually apply a frame-by-frame enhancement strategy without considering the temporal correlation between them, which often causes a flickering problem. In addition, most methods are designed for endoscopic devices with fixed imaging styles and cannot be easily adapted to different devices. In this paper, we propose a Style-Guided Network (SGNet) for unpaired Low-Light Colonoscopy Video Enhancement (LLCVE). Given that collecting content-consistent paired videos is difficult, SGNet adopts a CycleGAN-based framework to convert low-light videos to normal-light videos, in which a Temporal Compensation (TC) module and a Style Guidance (SG) module are proposed to alleviate the flickering problem and achieve flexible style transfer, respectively. The TC module compensates for a low-light frame by learning the correlated feature of its adjacent frames, thereby improving the temporal smoothness of the enhanced video. The SG module encodes the text of the imaging style and adaptively explores its intrinsic relationships with video features to obtain style representations, which are then used to guide the subsequent enhancement process. Extensive experiments on a curated database show that SGNet achieves promising performance on the LLCVE task, outperforming state-of-the-art methods in both quantitative metrics and visual quality.
低照度结肠镜检查需要视频增强方法,因为结肠镜检查中光照不足会妨碍疾病的准确诊断,并对手术过程产生不利影响。现有的弱光视频增强方法通常采用逐帧增强策略,而不考虑它们之间的时间相关性,这往往会导致闪烁问题。此外,大多数方法都是针对固定成像方式的内窥镜设备设计的,不容易适应不同的设备。在本文中,我们提出了一种用于非配对低光结肠镜视频增强(LLCVE)的风格引导网络(SGNet)。针对收集内容一致的配对视频比较困难的问题,SGNet采用基于cyclegan的框架将弱光视频转换为常光视频,其中提出了时间补偿(Temporal Compensation, TC)模块和风格引导(Style Guidance, SG)模块,分别缓解闪烁问题和实现灵活的风格转换。TC模块通过学习相邻帧的相关特征来补偿弱光帧,从而提高增强视频的时间平滑度。SG模块对图像风格文本进行编码,并自适应地探索其与视频特征的内在关系,以获得风格表征,然后用于指导后续的增强过程。在一个精心策划的数据库上进行的大量实验表明,SGNet在LLCVE任务上取得了很好的性能,在定量指标和视觉质量方面都优于最先进的方法。
{"title":"SGNet: Style-Guided Network With Temporal Compensation for Unpaired Low-Light Colonoscopy Video Enhancement","authors":"Guanghui Yue;Lixin Zhang;Wanqing Liu;Jingfeng Du;Tianwei Zhou;Hanhe Lin;Qiuping Jiang;Wenqi Ren","doi":"10.1109/TIP.2025.3644172","DOIUrl":"10.1109/TIP.2025.3644172","url":null,"abstract":"A low-light colonoscopy video enhancement method is needed as poor illumination in colonoscopy can hinder accurate disease diagnosis and adversely affect surgical procedures. Existing low-light video enhancement methods usually apply a frame-by-frame enhancement strategy without considering the temporal correlation between them, which often causes a flickering problem. In addition, most methods are designed for endoscopic devices with fixed imaging styles and cannot be easily adapted to different devices. In this paper, we propose a Style-Guided Network (SGNet) for unpaired Low-Light Colonoscopy Video Enhancement (LLCVE). Given that collecting content-consistent paired videos is difficult, SGNet adopts a CycleGAN-based framework to convert low-light videos to normal-light videos, in which a Temporal Compensation (TC) module and a Style Guidance (SG) module are proposed to alleviate the flickering problem and achieve flexible style transfer, respectively. The TC module compensates for a low-light frame by learning the correlated feature of its adjacent frames, thereby improving the temporal smoothness of the enhanced video. The SG module encodes the text of the imaging style and adaptively explores its intrinsic relationships with video features to obtain style representations, which are then used to guide the subsequent enhancement process. Extensive experiments on a curated database show that SGNet achieves promising performance on the LLCVE task, outperforming state-of-the-art methods in both quantitative metrics and visual quality.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"234-246"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation 自适应锚对准测试时间自适应图像分割
IF 13.7 Pub Date : 2025-12-22 DOI: 10.1109/TIP.2025.3644789
Jianghao Wu;Xiangde Luo;Yubo Zhou;Lianming Wu;Guotai Wang;Shaoting Zhang
Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose A3-TTA, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at https://github.com/HiLab-git/A3-TTA
测试时间自适应(TTA)为在域移位情况下部署图像分割模型提供了一种实用的解决方案,无需访问源数据或重新训练。在现有的翻译策略中,基于伪标签的方法表现出了良好的性能。然而,它们通常依赖于扰动集合启发式(例如,dropout采样,测试时间增强,高斯噪声),这些方法缺乏分布基础,产生不稳定的训练信号。这可能会在适应过程中引发错误积累和灾难性遗忘。为了解决这个问题,我们提出了A3-TTA,这是一个通过锚引导监督构建可靠伪标签的TTA框架。具体来说,我们使用类紧凑密度度量来识别预测良好的目标域图像,假设自信的预测意味着与源域的分布接近。这些锚点作为稳定的参考来指导伪标签的生成,并通过语义一致性和边界感知熵最小化进一步正则化伪标签。此外,我们引入了一种自适应指数移动平均策略,以减轻标签噪声并稳定自适应过程中的模型更新。在多域医学图像(心脏结构和前列腺分割)和自然图像上,与源模型相比,A3-TTA显著提高了平均Dice分数10.40至17.68个百分点,在不同的分割模型架构下优于几种最先进的TTA方法。A3-TTA在连续TTA方面也表现出色,在连续目标域保持较高的性能,具有较强的抗遗忘能力。该代码将在https://github.com/HiLab-git/A3-TTA上公开发布
{"title":"A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation","authors":"Jianghao Wu;Xiangde Luo;Yubo Zhou;Lianming Wu;Guotai Wang;Shaoting Zhang","doi":"10.1109/TIP.2025.3644789","DOIUrl":"10.1109/TIP.2025.3644789","url":null,"abstract":"Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose A3-TTA, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at <uri>https://github.com/HiLab-git/A3-TTA</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8511-8522"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation PoseMoE:用于单目三维人体姿态估计的混合专家网络
IF 13.7 Pub Date : 2025-12-22 DOI: 10.1109/TIP.2025.3644785
Mengyuan Liu;Jiajie Liu;Jinyan Zhang;Wenhao Li;Junsong Yuan
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: 1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. 2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
基于提升的方法通过利用检测到的2D姿态作为中间表示,主导了单目3D人体姿态估计。最终3D人体姿态的2D分量受益于检测到的2D姿态,而其深度对应部分必须从头开始估计。基于提升的方法将检测到的2D位姿和未知深度编码在纠缠的特征空间中,明确地将深度不确定性引入到检测到的2D位姿中,从而限制了整体估计精度。这项工作表明,深度表示是关键的估计过程。具体来说,当深度处于初始的、完全未知的状态时,将深度特征与2D姿态特征联合编码不利于估计过程。相比之下,当深度最初通过基于网络的估计细化到更可靠的状态时,将其与2D姿态信息一起编码是有益的。为了解决这一限制,我们提出了一个名为PoseMoE的用于单眼3D姿态估计的混合专家网络。我们的方法引入了:1)一个混合专家网络,其中专门的专家模块精炼已检测到的2D姿态特征并学习深度特征。这种混合专家设计解决了二维姿态和深度的特征编码过程,从而减少了不确定深度特征对二维姿态特征的显式影响。2)提出跨专家知识聚合模块,实现跨专家时空语境信息的聚合。这一步通过2D姿态和深度之间的双向映射来增强特征。大量的实验表明,我们提出的PoseMoE在三个广泛使用的数据集(Human3.6M, MPI-INF-3DHP和3DPW)上优于传统的基于提升的方法。
{"title":"PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation","authors":"Mengyuan Liu;Jiajie Liu;Jinyan Zhang;Wenhao Li;Junsong Yuan","doi":"10.1109/TIP.2025.3644785","DOIUrl":"10.1109/TIP.2025.3644785","url":null,"abstract":"The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: 1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. 2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8537-8551"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Degradation-Aware Prompted Transformer for Unified Medical Image Restoration 用于统一医学图像恢复的退化感知提示变压器
IF 13.7 Pub Date : 2025-12-22 DOI: 10.1109/TIP.2025.3644795
Jinbao Wei;Gang Yang;Zhijie Wang;Shimin Tao;Aiping Liu;Xun Chen
Medical image restoration (MedIR) aims to recover high-quality images from degraded inputs, yet faces unique challenges from physics-driven degradations and multi-modal task interference. While existing all-in-one methods handle natural image degradations well, they struggle with medical scenarios due to limited degradation perception and suboptimal multi-task optimization. In response, we introduce DaPT, a Degradation-aware Prompted Transformer, which integrates dynamic prompt learning and modular expert mining for unified MedIR. First, DaPT introduces spatially compact prompts with optimal transport regularization, amplifying inter-prompt differences to capture diverse degradation patterns. Second, a mixture of experts dynamically routes inputs to specialized modules via prompt guidance, resolving task conflicts while reducing computational overhead. The synergy of prompt learning and expert mining further enables robust restoration across multi-modal medical data, offering a practical solution for clinical imaging. Extensive experiments across multiple modalities (MRI, CT, PET) and diverse degradations, covering both in-distribution and out-of-distribution scenarios, demonstrate that DaPT consistently outperforms state-of-the-art methods and generalizes reliably to unseen settings, underscoring its robustness, effectiveness, and clinical practicality. The source code will be released at https://github.com/weijinbao1998/DaPT
医学图像恢复(MedIR)旨在从退化的输入中恢复高质量的图像,但面临着物理驱动的退化和多模态任务干扰的独特挑战。虽然现有的一体化方法可以很好地处理自然图像的退化,但由于退化感知有限和次优的多任务优化,它们在医疗场景中很困难。作为回应,我们引入了DaPT,一种退化感知提示变压器,它集成了动态提示学习和模块化专家挖掘,用于统一的MedIR。首先,DaPT引入具有最佳传输正则化的空间紧凑提示,放大提示间差异以捕获不同的退化模式。其次,混合专家通过及时的指导动态地将输入路由到专门的模块,在减少计算开销的同时解决任务冲突。快速学习和专家挖掘的协同作用进一步实现了跨多模态医疗数据的稳健恢复,为临床成像提供了实用的解决方案。在多种模式(MRI、CT、PET)和不同退化情况下进行的广泛实验,涵盖了分布内和分布外的情况,表明DaPT始终优于最先进的方法,并可靠地推广到未知的环境,强调了其稳健性、有效性和临床实用性。源代码将在https://github.com/weijinbao1998/DaPT上发布
{"title":"Degradation-Aware Prompted Transformer for Unified Medical Image Restoration","authors":"Jinbao Wei;Gang Yang;Zhijie Wang;Shimin Tao;Aiping Liu;Xun Chen","doi":"10.1109/TIP.2025.3644795","DOIUrl":"10.1109/TIP.2025.3644795","url":null,"abstract":"Medical image restoration (MedIR) aims to recover high-quality images from degraded inputs, yet faces unique challenges from physics-driven degradations and multi-modal task interference. While existing all-in-one methods handle natural image degradations well, they struggle with medical scenarios due to limited degradation perception and suboptimal multi-task optimization. In response, we introduce DaPT, a Degradation-aware Prompted Transformer, which integrates dynamic prompt learning and modular expert mining for unified MedIR. First, DaPT introduces spatially compact prompts with optimal transport regularization, amplifying inter-prompt differences to capture diverse degradation patterns. Second, a mixture of experts dynamically routes inputs to specialized modules via prompt guidance, resolving task conflicts while reducing computational overhead. The synergy of prompt learning and expert mining further enables robust restoration across multi-modal medical data, offering a practical solution for clinical imaging. Extensive experiments across multiple modalities (MRI, CT, PET) and diverse degradations, covering both in-distribution and out-of-distribution scenarios, demonstrate that DaPT consistently outperforms state-of-the-art methods and generalizes reliably to unseen settings, underscoring its robustness, effectiveness, and clinical practicality. The source code will be released at <uri>https://github.com/weijinbao1998/DaPT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8583-8598"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Frequency Attention and Color Contrast Constraint for Remote Sensing Dehazing 遥感去雾的交叉频率关注与色彩对比约束
IF 13.7 Pub Date : 2025-12-22 DOI: 10.1109/TIP.2025.3644167
Yuxin Feng;Jufeng Li;Tao Huang;Fangfang Wu;Yakun Ju;Chunxu Li;Weisheng Dong;Alex C. Kot
Current deep learning-based methods for remote sensing image dehazing have developed rapidly, yet they still commonly struggle to simultaneously preserve fine texture details and restore accurate colors. The fundamental reason lies in the insufficient modeling of high-frequency information that captures structural details, as well as the lack of effective constraints for color restoration. To address the insufficient modeling of global high-frequency information, we first develop an omni-directional high-frequency feature in painting mechanism that leverages the wavelet transform to extract multi-directional high-frequency components. While maintaining the advantage of linear complexity, it models global long-range texture dependencies through cross-frequency perception. Then, to further strengthen local high-frequency representation, we design a high-frequency prompt attention module that dynamically injects wavelet-domain optimized high-frequency features as cross-level guidance signals, significantly enhancing the model’s capability in edge sharpness restoration and texture detail reconstruction. Further, to alleviate the problem of inaccurate color restoration, we propose a color contrast loss function based on the HSV color space, which explicitly models the statistical distribution differences of brightness and saturation in hazy regions, guiding the model to generate dehazed images with consistent colors and natural visual appearance. Finally, extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing approaches in both texture detail restoration and color consistency. Further results and code are available at: https://github.com/fyxnl/C4RSD
目前基于深度学习的遥感图像去雾方法发展迅速,但它们仍然难以同时保持精细的纹理细节和恢复准确的颜色。其根本原因在于对捕获结构细节的高频信息建模不足,以及缺乏有效的色彩还原约束。为了解决全局高频信息建模不足的问题,我们首先开发了一种利用小波变换提取多向高频成分的全向高频特征。在保持线性复杂性优势的同时,它通过交叉频率感知对全局远程纹理依赖进行建模。然后,为了进一步增强局部高频表征,我们设计了高频提示注意模块,动态注入小波域优化的高频特征作为跨电平引导信号,显著增强了模型的边缘锐度恢复和纹理细节重建能力。此外,为了缓解色彩恢复不准确的问题,我们提出了一种基于HSV色彩空间的色彩对比度损失函数,该函数明确地对雾霾区域亮度和饱和度的统计分布差异进行建模,指导模型生成色彩一致、视觉外观自然的去雾图像。最后,在多个基准数据集上的大量实验表明,该方法在纹理细节恢复和颜色一致性方面都优于现有方法。进一步的结果和代码可在:https://github.com/fyxnl/C4RSD
{"title":"Cross-Frequency Attention and Color Contrast Constraint for Remote Sensing Dehazing","authors":"Yuxin Feng;Jufeng Li;Tao Huang;Fangfang Wu;Yakun Ju;Chunxu Li;Weisheng Dong;Alex C. Kot","doi":"10.1109/TIP.2025.3644167","DOIUrl":"10.1109/TIP.2025.3644167","url":null,"abstract":"Current deep learning-based methods for remote sensing image dehazing have developed rapidly, yet they still commonly struggle to simultaneously preserve fine texture details and restore accurate colors. The fundamental reason lies in the insufficient modeling of high-frequency information that captures structural details, as well as the lack of effective constraints for color restoration. To address the insufficient modeling of global high-frequency information, we first develop an omni-directional high-frequency feature in painting mechanism that leverages the wavelet transform to extract multi-directional high-frequency components. While maintaining the advantage of linear complexity, it models global long-range texture dependencies through cross-frequency perception. Then, to further strengthen local high-frequency representation, we design a high-frequency prompt attention module that dynamically injects wavelet-domain optimized high-frequency features as cross-level guidance signals, significantly enhancing the model’s capability in edge sharpness restoration and texture detail reconstruction. Further, to alleviate the problem of inaccurate color restoration, we propose a color contrast loss function based on the HSV color space, which explicitly models the statistical distribution differences of brightness and saturation in hazy regions, guiding the model to generate dehazed images with consistent colors and natural visual appearance. Finally, extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing approaches in both texture detail restoration and color consistency. Further results and code are available at: <uri>https://github.com/fyxnl/C4RSD</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8552-8567"},"PeriodicalIF":13.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145807709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Meta-TIP: An Unsupervised End-to-End Fusion Network for Multi-Dataset Style-Adaptive Threat Image Projection 元提示:用于多数据集风格自适应威胁图像投影的无监督端到端融合网络
IF 13.7 Pub Date : 2025-12-19 DOI: 10.1109/TIP.2025.3609135
Bowen Ma;Tong Jia;Hao Wang;Dongyue Chen
Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.
威胁图像投影(TIP)是一种方便有效的扩展行李x射线图像的方法,对于培训安检人员和计算机辅助检查系统都是必不可少的。现有的方法主要分为两类:基于x射线成像原理的方法和基于gan的生成方法。前者将违禁物品的获取和投影作为两个单独的步骤,很少考虑来自不同数据集的源违禁物品和目标x射线图像之间的风格一致性,使得其在实际应用中的灵活性和可靠性降低。尽管基于gan的方法可以直接在目标图像上生成视觉上一致的违禁物品,但其训练不稳定,缺乏可解释性,严重影响了生成物品的质量。为了克服这些限制,我们提出了一个概念简单、灵活和无监督的端到端TIP框架,称为Meta-TIP,它以风格自适应的方式将从源图像中提取的禁止项叠加到目标图像上。具体来说,Meta-TIP主要应用了三个创新:1)利用一种新的前景-背景对比损失,从杂乱的源图像中重建纯违禁物品;2)材料感知风格自适应投影模块根据目标图像中相似物体的风格有针对性地学习两个调制参数,控制违禁物品的外观;3)基于TIP原理设计了一种新的对数形式损失,以无监督的方式优化合成结果。我们在SIXray、OPIXray、PIXray和PIDray四个公共数据集上全面验证了合成x射线图像的真实性和训练效果,结果证实我们的框架可以灵活地生成非常逼真的合成图像,没有任何限制。
{"title":"Meta-TIP: An Unsupervised End-to-End Fusion Network for Multi-Dataset Style-Adaptive Threat Image Projection","authors":"Bowen Ma;Tong Jia;Hao Wang;Dongyue Chen","doi":"10.1109/TIP.2025.3609135","DOIUrl":"https://doi.org/10.1109/TIP.2025.3609135","url":null,"abstract":"Threat Image Projection (TIP) is a convenient and effective means to expand X-ray baggage images, which is essential for training both security personnel and computer-aided screening systems. Existing methods are primarily divided into two categories: X-ray imaging principle-based methods and GAN-based generative methods. The former cast prohibited items acquisition and projection as two individual steps and rarely consider the style consistency between the source prohibited items and target X-ray images from different datasets, making them less flexible and reliable for practical applications. Although GAN-based methods can directly generate visually consistent prohibited items on target images, they suffer from unstable training and lack of interpretability, which significantly impact the quality of the generated items. To overcome these limitations, we present a conceptually simple, flexible and unsupervised end-to-end TIP framework, termed as Meta-TIP, which superimposes the prohibited item distilled from the source image onto the target image in a style-adaptive manner. Specifically, Meta-TIP mainly applies three innovations: 1) reconstruct a pure prohibited item from a cluttered source image with a novel foreground-background contrastive loss; 2) a material-aware style-adaptive projection module learns two modulation parameters pertinently based on the style of similar material objects in the target image to control the appearance of prohibited items; 3) a novel logarithmic form loss is well-designed based on the principle of TIP to optimize synthetic results in an unsupervised manner. We comprehensively verify the authenticity and training effect of the synthetic X-ray images on four public datasets, i.e., SIXray, OPIXray, PIXray, and PIDray dataset, and the results confirm that our framework can flexibly generate very realistic synthetic images without any limitations.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8317-8331"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Face Forgery Detection With CLIP-Enhanced Multi-Encoder Distillation 人脸伪造检测与剪辑增强多编码器蒸馏
IF 13.7 Pub Date : 2025-12-19 DOI: 10.1109/TIP.2025.3644125
Chunlei Peng;Tianzhe Yan;Decheng Liu;Nannan Wang;Ruimin Hu;Xinbo Gao
With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.
随着人脸伪造技术的发展,假人脸猖獗,威胁着许多领域的安全性和真实性。因此,研究人脸伪造检测具有十分重要的意义。目前,现有的检测方法在特征提取的全面性和模型适应性方面存在不足,难以准确处理复杂多变的伪造场景。然而,多模态模型的兴起为当前的伪造检测方法提供了新的见解。目前,大多数方法使用相对简单的文本提示来描述真脸和假脸之间的差异。然而,这些研究人员忽略了CLIP模型本身并不具备伪造检测的相关知识。为此,本文提出了一种基于多编码器融合和跨模态知识蒸馏的人脸伪造检测方法。一方面,融合了CLIP模型和伪造模型的先验知识;另一方面,通过对齐蒸馏,学生模型可以学习到教师模型捕获的伪造样本的视觉异常模式和语义特征。具体而言,本文通过融合CLIP文本编码器和CLIP图像编码器提取人脸照片的特征,并利用伪造检测领域的数据集对Deepfake-V2-Model进行预训练和微调,增强检测能力,将其作为教师模型。同时,将教师模型的视觉和语言模式与预训练学生模型的视觉模式对齐,并将对齐后的表征细化到学生模型。这不仅结合了CLIP图像编码器的丰富表示和文本嵌入的优秀泛化能力,而且使原始模型能够有效地获取用于伪造检测的相关知识。实验表明,该方法有效地提高了人脸伪造检测的性能。
{"title":"Face Forgery Detection With CLIP-Enhanced Multi-Encoder Distillation","authors":"Chunlei Peng;Tianzhe Yan;Decheng Liu;Nannan Wang;Ruimin Hu;Xinbo Gao","doi":"10.1109/TIP.2025.3644125","DOIUrl":"10.1109/TIP.2025.3644125","url":null,"abstract":"With the development of face forgery technology, fake faces are rampant, threatening the security and authenticity of many fields. Therefore, it is of great significance to study face forgery detection. At present, existing detection methods have deficiencies in the comprehensiveness of feature extraction and model adaptability, and it is difficult to accurately deal with complex and changeable forgery scenarios. However, the rise of multimodal models provides new insights for current forgery detection methods. At present, most methods use relatively simple text prompts to describe the difference between real and fake faces. However, these researchers ignore that the CLIP model itself does not have the relevant knowledge of forgery detection. Therefore, our paper proposes a face forgery detection method based on multi-encoder fusion and cross-modal knowledge distillation. On the one hand, the prior knowledge of the CLIP model and the forgery model is fused. On the other hand, through the alignment distillation, the student model can learn the visual abnormal patterns and semantic features of the forged samples captured by the teacher model. Specifically, our paper extracts the features of face photos by fusing the CLIP text encoder and the CLIP image encoder, and uses the dataset in the field of forgery detection to pretrain and fine-tune the Deepfake-V2-Model to enhance the detection ability, which are regarded as the teacher model. At the same time, the visual and language patterns of the teacher model are aligned with the visual patterns of the pretrained student model, and the aligned representations are refined to the student model. This not only combines the rich representation of the CLIP image encoder and the excellent generalization ability of text embedding, but also enables the original model to effectively acquire relevant knowledge for forgery detection. Experiments show that our method effectively improves the performance on face forgery detection.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8474-8484"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TG-TSGNet: A Text-Guided Arbitrary-Resolution Terrain Scene Generation Network 文本引导的任意分辨率地形场景生成网络
IF 13.7 Pub Date : 2025-12-19 DOI: 10.1109/TIP.2025.3644231
Yifan Zhu;Yan Wang;Xinghui Dong
With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at https://github.com/INDTLab/TG-TSGNet). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.
随着增强现实、虚拟现实、地理制图等领域对地形可视化需求的不断增加,传统的地形场景建模方法在处理效率、内容真实感、语义一致性等方面面临巨大挑战。为了解决这些挑战,我们提出了一个文本引导的任意分辨率地形场景生成网络(TG-TSGNet),它包含一个ConvMamba-VQGAN、一个文本引导子网络和一个任意分辨率图像超分辨率模块(ARSRM)。ConvMamba-VQGAN建立在基于convbased Local Representation Block (CLRB)和基于mamba Global Representation Block (MGRB)的基础上,利用局部和全局特征。此外,文本引导子网络包括一个文本编码器和一个文本-图像对齐模块(TIAM),以便将文本语义整合到图像表示中。此外,ARSRM可以与ConvMamba-VQGAN一起训练,以完成图像超分辨率的任务。为了完成文本引导的地形场景生成任务,我们为自然地形场景数据集(NTSD)的38个类别的36,672幅图像导出了一组文本描述。这些描述可用于训练和测试TG-TSGNet(数据集、模型和源代码可在https://github.com/INDTLab/TG-TSGNet上获得)。实验结果表明,TG-TSGNet在图像真实感和语义一致性方面优于或至少与基线方法相当,并且具有适当的效率。我们认为,由于TG-TSGNet不仅能够捕获地形场景的局部和全局特征以及语义,而且还可以降低图像生成的计算成本,因此具有良好的性能。
{"title":"TG-TSGNet: A Text-Guided Arbitrary-Resolution Terrain Scene Generation Network","authors":"Yifan Zhu;Yan Wang;Xinghui Dong","doi":"10.1109/TIP.2025.3644231","DOIUrl":"10.1109/TIP.2025.3644231","url":null,"abstract":"With the increasing demand for terrain visualization in many fields, such as augmented reality, virtual reality and geographic mapping, traditional terrain scene modeling methods encounter great challenges in processing efficiency, content realism and semantic consistency. To address these challenges, we propose a Text-Guided Arbitrary-Resolution Terrain Scene Generation Network (TG-TSGNet), which contains a ConvMamba-VQGAN, a Text Guidance Sub-network and an Arbitrary-Resolution Image Super-Resolution Module (ARSRM). The ConvMamba-VQGAN is built on top of the Conv-Based Local Representation Block (CLRB) and the Mamba-Based Global Representation Block (MGRB) that we design, to utilize local and global features. Furthermore, the Text Guidance Sub-network comprises a text encoder and a Text-Image Alignment Module (TIAM) for the sake of incorporating textual semantics into image representation. In addition, the ARSRM can be trained together with the ConvMamba-VQGAN, to perform the task of image super-resolution. To fulfill the text-guided terrain scene generation task, we derive a set of textual descriptions for the 36,672 images across the 38 categories of the Natural Terrain Scene Data Set (NTSD). These descriptions can be used to train and test the TG-TSGNet (The data set, model and source code are available at <uri>https://github.com/INDTLab/TG-TSGNet</uri>). Experimental results show that the TG-TSGNet outperforms, or at least performs comparably to, the baseline methods in image realism and semantic consistency with proper efficiency. We believe that the promising performance should be due to the ability of the TG-TSGNet not only to capture both the local and global characteristics and the semantics of terrain scenes, but also to reduce the computational cost of image generation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8614-8626"},"PeriodicalIF":13.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1