We propose a novel Greedy Graph Cut (GGC) algorithm to address the graph partitioning problem. The algorithm begins by treating each data point as an individual cluster and iteratively merges cluster pairs that maximize the reduction in the global objective function until the desired number of clusters is achieved. We provide a theoretical proof of the monotonic convergence of the objective function values throughout this process. To improve computational efficiency, the algorithm restricts merging operations to adjacent clusters, resulting in a computational complexity that scales nearly linearly with the sample size. A significant advantage of our greedy approach is its deterministic nature, which ensures consistent results across multiple runs. This stands in contrast to many existing algorithms that are sensitive to random initialization effects. We demonstrate the effectiveness of the proposed algorithm by applying it to the Normalized Cut (N-Cut) problem, a well-studied variant of graph partitioning. Extensive experimental results show that GGC consistently outperforms the conventional two-stage optimization approach—which involves eigendecomposition followed by k-means clustering—in solving the N-Cut problem. Furthermore, comparative analyses reveal that GGC achieves superior performance compared to several state-of-the-art clustering algorithms.
{"title":"A Greedy Strategy for Graph Cut","authors":"Shenfei Pei;Huijuan Dong;Nianci Guan;Zhongqi Lin;Feiping Nie;Xudong Jiang;Zengwei Zheng","doi":"10.1109/TIP.2026.3661874","DOIUrl":"10.1109/TIP.2026.3661874","url":null,"abstract":"We propose a novel Greedy Graph Cut (GGC) algorithm to address the graph partitioning problem. The algorithm begins by treating each data point as an individual cluster and iteratively merges cluster pairs that maximize the reduction in the global objective function until the desired number of clusters is achieved. We provide a theoretical proof of the monotonic convergence of the objective function values throughout this process. To improve computational efficiency, the algorithm restricts merging operations to adjacent clusters, resulting in a computational complexity that scales nearly linearly with the sample size. A significant advantage of our greedy approach is its deterministic nature, which ensures consistent results across multiple runs. This stands in contrast to many existing algorithms that are sensitive to random initialization effects. We demonstrate the effectiveness of the proposed algorithm by applying it to the Normalized Cut (N-Cut) problem, a well-studied variant of graph partitioning. Extensive experimental results show that GGC consistently outperforms the conventional two-stage optimization approach—which involves eigendecomposition followed by k-means clustering—in solving the N-Cut problem. Furthermore, comparative analyses reveal that GGC achieves superior performance compared to several state-of-the-art clustering algorithms.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2224-2234"},"PeriodicalIF":13.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-11DOI: 10.1109/TIP.2026.3661813
Yihang Xu;Qiulei Dong
Self-supervised monocular depth estimation for fisheye cameras has attracted much attention in recent years due to their large view range. However, the performances of existing methods in this field are generally limited due to the inevitable severe distortions in fisheye images. To address this problem, we propose a distortion-aware depth self-updating network for self-supervised fisheye monocular depth estimation called DDS-Net. The proposed DDS-Net method employs a coarse-to-fine learning strategy, in which an explored fine depth predictor for predicting final depth is optimized with the predicted scene depths by a pretrained coarse depth predictor. The fine depth predictor contains a distortion-aware fisheye cost volume construction module and a depth self-updating module. The distortion-aware fisheye cost volume construction module is designed to construct a fisheye cost volume by learning the corresponding feature matching cost between continuous fisheye frames, which enables more accurate pixel-level depth cues to be captured under severe distortions. Based on the constructed cost volume and the initial depth estimated by the pretrained coarse depth predictor, the depth self-updating module is designed to self-update the depth map in an iterative manner. Extensive experimental results on 3 fisheye datasets demonstrate that the proposed method significantly outperforms 14 state-of-the-art methods for fisheye monocular depth estimation.
{"title":"Distortion-Aware Depth Self-Updating for Self-Supervised Fisheye Monocular Depth Estimation","authors":"Yihang Xu;Qiulei Dong","doi":"10.1109/TIP.2026.3661813","DOIUrl":"10.1109/TIP.2026.3661813","url":null,"abstract":"Self-supervised monocular depth estimation for fisheye cameras has attracted much attention in recent years due to their large view range. However, the performances of existing methods in this field are generally limited due to the inevitable severe distortions in fisheye images. To address this problem, we propose a distortion-aware depth self-updating network for self-supervised fisheye monocular depth estimation called DDS-Net. The proposed DDS-Net method employs a coarse-to-fine learning strategy, in which an explored fine depth predictor for predicting final depth is optimized with the predicted scene depths by a pretrained coarse depth predictor. The fine depth predictor contains a distortion-aware fisheye cost volume construction module and a depth self-updating module. The distortion-aware fisheye cost volume construction module is designed to construct a fisheye cost volume by learning the corresponding feature matching cost between continuous fisheye frames, which enables more accurate pixel-level depth cues to be captured under severe distortions. Based on the constructed cost volume and the initial depth estimated by the pretrained coarse depth predictor, the depth self-updating module is designed to self-update the depth map in an iterative manner. Extensive experimental results on 3 fisheye datasets demonstrate that the proposed method significantly outperforms 14 state-of-the-art methods for fisheye monocular depth estimation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1883-1898"},"PeriodicalIF":13.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and super-resolution. It unfolds the iterative optimization steps into a stack of sequentially linked blocks. Each block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known level. However, existing DUNs suffer from two critical limitations: 1) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and 2) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained scenarios. To address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient DUN. LoRun introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding step. This design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to $N$ times parameter reduction for an $N$ -stage DUN with on-par or better performance. Extensive experiments conducted on three IR tasks validate the efficiency of our method.
{"title":"Deep LoRA-Unfolding Networks for Image Restoration","authors":"Xiangming Wang;Haijin Zeng;Benteng Sun;Jiezhang Cao;Kai Zhang;Qiangqiang Shen;Yongyong Chen","doi":"10.1109/TIP.2026.3661406","DOIUrl":"10.1109/TIP.2026.3661406","url":null,"abstract":"Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and super-resolution. It unfolds the iterative optimization steps into a stack of sequentially linked blocks. Each block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known level. However, existing DUNs suffer from two critical limitations: 1) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and 2) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained scenarios. To address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient DUN. LoRun introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding step. This design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to <inline-formula> <tex-math>$N$ </tex-math></inline-formula> times parameter reduction for an <inline-formula> <tex-math>$N$ </tex-math></inline-formula>-stage DUN with on-par or better performance. Extensive experiments conducted on three IR tasks validate the efficiency of our method.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1858-1869"},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-10DOI: 10.1109/TIP.2026.3661408
Yi Ke Yun, Weisi Lin
Existing Image Quality Assessment (IQA) models are limited to either full reference or no reference evaluation tasks, while humans can seamlessly switch between these assessment types. This motivates us to explore resolving these two tasks using a versatile model. In this work, we propose a novel framework that unifies full reference and no reference IQA. Our approach utilizes an encoder to extract multi-level features from images and introduces a Hierarchical Attention module to adaptively handle spatial distortions for both full reference and no reference inputs. Additionally, we develop a Semantic Distortion Aware module to analyze feature correlations between shallow and deep layers of the encoder, thereby accounting for the varying effects of different distortions on these layers. Our proposed framework achieves state-of-the-art performance for both full-reference and no-reference IQA tasks when trained separately. Furthermore, when the model is trained jointly on both types of tasks, it not only enhances performance in no-reference IQA but also maintains competitive results in full-reference IQA. This integrated approach facilitates a single training process that efficiently addresses both IQA tasks, representing a significant advancement in model versatility and performance.
{"title":"You Only Train Once: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment.","authors":"Yi Ke Yun, Weisi Lin","doi":"10.1109/TIP.2026.3661408","DOIUrl":"https://doi.org/10.1109/TIP.2026.3661408","url":null,"abstract":"<p><p>Existing Image Quality Assessment (IQA) models are limited to either full reference or no reference evaluation tasks, while humans can seamlessly switch between these assessment types. This motivates us to explore resolving these two tasks using a versatile model. In this work, we propose a novel framework that unifies full reference and no reference IQA. Our approach utilizes an encoder to extract multi-level features from images and introduces a Hierarchical Attention module to adaptively handle spatial distortions for both full reference and no reference inputs. Additionally, we develop a Semantic Distortion Aware module to analyze feature correlations between shallow and deep layers of the encoder, thereby accounting for the varying effects of different distortions on these layers. Our proposed framework achieves state-of-the-art performance for both full-reference and no-reference IQA tasks when trained separately. Furthermore, when the model is trained jointly on both types of tasks, it not only enhances performance in no-reference IQA but also maintains competitive results in full-reference IQA. This integrated approach facilitates a single training process that efficiently addresses both IQA tasks, representing a significant advancement in model versatility and performance.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-10DOI: 10.1109/TIP.2026.3661417
Yang Yang;Huibin Luo;Haotian Wang;Jingchi Jiang;Jie Liu;Jian Wei;Ming Fang
Reasoning segmentation (RS) interprets implicit textual instructions to accurately segment target regions. This reasoning capability transforms ambiguous non-expert queries into precise pixel-level masks, thereby enabling downstream tasks like area measurement and density analysis with a level of precision unattainable by detection methods. However, existing RS models are not tailored for agriculture and lack domain-specific knowledge, which poses challenges in handling similar pest appearances and small target scales. To bridge this gap, we introduce a fine-grained pest RS task with two subtasks: Pest Discriminative Referring Expression Segmentation (PDRES) and Pest Exclusion Reasoning Segmentation (PERS). Based on this, we propose PestScope, which integrates vision, language, and reasoning for fine-grained pest segmentation. To tackle the exclusion of small non-target pests, we introduce a dedicated [NON] token alongside the standard [SEG] token for target pests. This guides the model to prioritize small target pests and suppress non-target background regions. To further address pest similarity, we propose an Exclusivity Suppression Loss, applying differentiated supervision to [SEG] and [NON] tokens to better separate target and non-target pests. Additionally, we develop an automated dataset construction pipeline to address the scarcity of fine-grained, difficulty-controllable pest RS datasets. It produces 45k and 27.6k image-text-mask samples for the PDRES and PERS tasks, respectively, covering 18 pest categories. Experiments show that in small and similar pest scenarios, integrating PestScope into mainstream models improves average gIoU by 4.28% on PDRES and 6.49% on PERS. For unseen pest categories, gIoU increases by 21.72% and 8.66%, respectively, demonstrating strong generalization. Code and datasets will be available at: https://github.com/aluodaydayup/PestScope
{"title":"PestScope: Exclusion-Aware Large Multimodal Model for Fine-Grained Agricultural Pest Segmentation","authors":"Yang Yang;Huibin Luo;Haotian Wang;Jingchi Jiang;Jie Liu;Jian Wei;Ming Fang","doi":"10.1109/TIP.2026.3661417","DOIUrl":"10.1109/TIP.2026.3661417","url":null,"abstract":"Reasoning segmentation (RS) interprets implicit textual instructions to accurately segment target regions. This reasoning capability transforms ambiguous non-expert queries into precise pixel-level masks, thereby enabling downstream tasks like area measurement and density analysis with a level of precision unattainable by detection methods. However, existing RS models are not tailored for agriculture and lack domain-specific knowledge, which poses challenges in handling similar pest appearances and small target scales. To bridge this gap, we introduce a fine-grained pest RS task with two subtasks: Pest Discriminative Referring Expression Segmentation (PDRES) and Pest Exclusion Reasoning Segmentation (PERS). Based on this, we propose PestScope, which integrates vision, language, and reasoning for fine-grained pest segmentation. To tackle the exclusion of small non-target pests, we introduce a dedicated [NON] token alongside the standard [SEG] token for target pests. This guides the model to prioritize small target pests and suppress non-target background regions. To further address pest similarity, we propose an Exclusivity Suppression Loss, applying differentiated supervision to [SEG] and [NON] tokens to better separate target and non-target pests. Additionally, we develop an automated dataset construction pipeline to address the scarcity of fine-grained, difficulty-controllable pest RS datasets. It produces 45k and 27.6k image-text-mask samples for the PDRES and PERS tasks, respectively, covering 18 pest categories. Experiments show that in small and similar pest scenarios, integrating PestScope into mainstream models improves average gIoU by 4.28% on PDRES and 6.49% on PERS. For unseen pest categories, gIoU increases by 21.72% and 8.66%, respectively, demonstrating strong generalization. Code and datasets will be available at: <uri>https://github.com/aluodaydayup/PestScope</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2034-2049"},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at https://github.com/hongkai-wei/MoMo-3DVLT.
{"title":"Monocular Multi-Object 3D Visual Language Tracking","authors":"Hongkai Wei;Rong Wang;Haixiang Hu;Shijie Sun;Xiangyu Song;Mingtao Feng;Keyu Guo;Yongle Huang;Hua Cui;Naveed Akhtar","doi":"10.1109/TIP.2026.3661407","DOIUrl":"10.1109/TIP.2026.3661407","url":null,"abstract":"Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at <uri>https://github.com/hongkai-wei/MoMo-3DVLT</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2050-2065"},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-09DOI: 10.1109/TIP.2026.3659746
Mengzu Liu;Junwei Xu;Weisheng Dong;Le Dong;Guangming Shi
Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at https://github.com/ZUGE0312/RPDUN
{"title":"Learning Retinex Prior for Compressive Hyperspectral Image Reconstruction","authors":"Mengzu Liu;Junwei Xu;Weisheng Dong;Le Dong;Guangming Shi","doi":"10.1109/TIP.2026.3659746","DOIUrl":"10.1109/TIP.2026.3659746","url":null,"abstract":"Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at <uri>https://github.com/ZUGE0312/RPDUN</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1786-1801"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-09DOI: 10.1109/TIP.2026.3660576
Zhiwen Yang;Yuxin Peng
Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.
{"title":"Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion","authors":"Zhiwen Yang;Yuxin Peng","doi":"10.1109/TIP.2026.3660576","DOIUrl":"10.1109/TIP.2026.3660576","url":null,"abstract":"Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at <uri>https://github.com/PKU-ICST-MIPL/MRA_TIP.</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1771-1785"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-09DOI: 10.1109/TIP.2026.3660575
Xinyi Wu;Cuiqun Chen;Hui Zeng;Zhiping Cai;Mang Ye
Sketch-based Person Retrieval (SBPR) aims to identify and retrieve a target individual across non-overlapping camera views using professional sketches as queries. In practice, sketches drawn by different artists often present diverse painting styles unpredictably. The substantial style variations among sketches pose significant challenges to the stability and generalizability of SBPR models. Prior works attempt to mitigate style variations through style manipulation methods, which inevitably undermine the inherent structural relations among multiple sketch features. This leads to overfitting on existing training styles and struggles with generalizing to new, unseen sketch styles. In this paper, we introduce FreeStyle, an innovative style-inclusive framework for SBPR, built upon the foundational CLIP architecture. FreeStyle explicitly models the relations across diverse sketch styles via style consistency enhancement, enabling dynamic adaptation to both seen and unseen style variations. Specifically, Diverse Style Semantic Unification is first devised to enhance the style consistency of each identity at the semantic level by introducing objective attribute-level semantic constraints. Meanwhile, Diverse Style Feature Squeezing tackles unclear feature boundaries among identities by concentrating the intra-identity space and separating the inter-identity space, thereby strengthening style consistency at the feature representation level. Additionally, considering the feature distribution discrepancy between sketches and photos, an identity-centric cross-modal prototype alignment mechanism is introduced to facilitate identity-aware cross-modal associations and promote a compact joint embedding space. Extensive experiments validate that FreeStyle not only achieves stable performance under seen style variations but also demonstrates strong generalization to unseen sketch styles.
{"title":"FreeStyle: Toward Style-Inclusive Sketch-Based Person Retrieval","authors":"Xinyi Wu;Cuiqun Chen;Hui Zeng;Zhiping Cai;Mang Ye","doi":"10.1109/TIP.2026.3660575","DOIUrl":"10.1109/TIP.2026.3660575","url":null,"abstract":"Sketch-based Person Retrieval (SBPR) aims to identify and retrieve a target individual across non-overlapping camera views using professional sketches as queries. In practice, sketches drawn by different artists often present diverse painting styles unpredictably. The substantial style variations among sketches pose significant challenges to the stability and generalizability of SBPR models. Prior works attempt to mitigate style variations through style manipulation methods, which inevitably undermine the inherent structural relations among multiple sketch features. This leads to overfitting on existing training styles and struggles with generalizing to new, unseen sketch styles. In this paper, we introduce FreeStyle, an innovative style-inclusive framework for SBPR, built upon the foundational CLIP architecture. FreeStyle explicitly models the relations across diverse sketch styles via style consistency enhancement, enabling dynamic adaptation to both seen and unseen style variations. Specifically, Diverse Style Semantic Unification is first devised to enhance the style consistency of each identity at the semantic level by introducing objective attribute-level semantic constraints. Meanwhile, Diverse Style Feature Squeezing tackles unclear feature boundaries among identities by concentrating the intra-identity space and separating the inter-identity space, thereby strengthening style consistency at the feature representation level. Additionally, considering the feature distribution discrepancy between sketches and photos, an identity-centric cross-modal prototype alignment mechanism is introduced to facilitate identity-aware cross-modal associations and promote a compact joint embedding space. Extensive experiments validate that FreeStyle not only achieves stable performance under seen style variations but also demonstrates strong generalization to unseen sketch styles.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1977-1992"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3659752
Boqiang Xu;Jinlin Wu;Jian Liang;Zhenan Sun;Hongbin Liu;Jiebo Luo;Zhen Lei
Recent advances in surgical robotics and computer vision have greatly improved intelligent systems’ autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video–text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.
{"title":"Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining","authors":"Boqiang Xu;Jinlin Wu;Jian Liang;Zhenan Sun;Hongbin Liu;Jiebo Luo;Zhen Lei","doi":"10.1109/TIP.2026.3659752","DOIUrl":"10.1109/TIP.2026.3659752","url":null,"abstract":"Recent advances in surgical robotics and computer vision have greatly improved intelligent systems’ autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video–text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1966-1976"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}