首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
A Greedy Strategy for Graph Cut 图割的贪心策略
IF 13.7 Pub Date : 2026-02-11 DOI: 10.1109/TIP.2026.3661874
Shenfei Pei;Huijuan Dong;Nianci Guan;Zhongqi Lin;Feiping Nie;Xudong Jiang;Zengwei Zheng
We propose a novel Greedy Graph Cut (GGC) algorithm to address the graph partitioning problem. The algorithm begins by treating each data point as an individual cluster and iteratively merges cluster pairs that maximize the reduction in the global objective function until the desired number of clusters is achieved. We provide a theoretical proof of the monotonic convergence of the objective function values throughout this process. To improve computational efficiency, the algorithm restricts merging operations to adjacent clusters, resulting in a computational complexity that scales nearly linearly with the sample size. A significant advantage of our greedy approach is its deterministic nature, which ensures consistent results across multiple runs. This stands in contrast to many existing algorithms that are sensitive to random initialization effects. We demonstrate the effectiveness of the proposed algorithm by applying it to the Normalized Cut (N-Cut) problem, a well-studied variant of graph partitioning. Extensive experimental results show that GGC consistently outperforms the conventional two-stage optimization approach—which involves eigendecomposition followed by k-means clustering—in solving the N-Cut problem. Furthermore, comparative analyses reveal that GGC achieves superior performance compared to several state-of-the-art clustering algorithms.
我们提出了一种新的贪心图割(GGC)算法来解决图划分问题。该算法首先将每个数据点视为一个单独的聚类,并迭代合并聚类对,使全局目标函数的减少最大化,直到达到所需的聚类数量。在此过程中,给出了目标函数值单调收敛的理论证明。为了提高计算效率,该算法将合并操作限制在相邻的簇上,导致计算复杂度几乎与样本量成线性关系。我们的贪心方法的一个显著优势是它的确定性,它确保了多次运行的一致结果。这与许多对随机初始化效果敏感的现有算法形成对比。我们通过将所提出的算法应用于规范化切割(N-Cut)问题来证明其有效性,规范化切割是图划分的一个很好的变体。大量的实验结果表明,GGC在解决N-Cut问题时始终优于传统的两阶段优化方法(包括特征分解和k均值聚类)。此外,对比分析表明,与几种最先进的聚类算法相比,GGC具有优越的性能。
{"title":"A Greedy Strategy for Graph Cut","authors":"Shenfei Pei;Huijuan Dong;Nianci Guan;Zhongqi Lin;Feiping Nie;Xudong Jiang;Zengwei Zheng","doi":"10.1109/TIP.2026.3661874","DOIUrl":"10.1109/TIP.2026.3661874","url":null,"abstract":"We propose a novel Greedy Graph Cut (GGC) algorithm to address the graph partitioning problem. The algorithm begins by treating each data point as an individual cluster and iteratively merges cluster pairs that maximize the reduction in the global objective function until the desired number of clusters is achieved. We provide a theoretical proof of the monotonic convergence of the objective function values throughout this process. To improve computational efficiency, the algorithm restricts merging operations to adjacent clusters, resulting in a computational complexity that scales nearly linearly with the sample size. A significant advantage of our greedy approach is its deterministic nature, which ensures consistent results across multiple runs. This stands in contrast to many existing algorithms that are sensitive to random initialization effects. We demonstrate the effectiveness of the proposed algorithm by applying it to the Normalized Cut (N-Cut) problem, a well-studied variant of graph partitioning. Extensive experimental results show that GGC consistently outperforms the conventional two-stage optimization approach—which involves eigendecomposition followed by k-means clustering—in solving the N-Cut problem. Furthermore, comparative analyses reveal that GGC achieves superior performance compared to several state-of-the-art clustering algorithms.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2224-2234"},"PeriodicalIF":13.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distortion-Aware Depth Self-Updating for Self-Supervised Fisheye Monocular Depth Estimation 畸变感知深度自更新的自监督鱼眼单目深度估计
IF 13.7 Pub Date : 2026-02-11 DOI: 10.1109/TIP.2026.3661813
Yihang Xu;Qiulei Dong
Self-supervised monocular depth estimation for fisheye cameras has attracted much attention in recent years due to their large view range. However, the performances of existing methods in this field are generally limited due to the inevitable severe distortions in fisheye images. To address this problem, we propose a distortion-aware depth self-updating network for self-supervised fisheye monocular depth estimation called DDS-Net. The proposed DDS-Net method employs a coarse-to-fine learning strategy, in which an explored fine depth predictor for predicting final depth is optimized with the predicted scene depths by a pretrained coarse depth predictor. The fine depth predictor contains a distortion-aware fisheye cost volume construction module and a depth self-updating module. The distortion-aware fisheye cost volume construction module is designed to construct a fisheye cost volume by learning the corresponding feature matching cost between continuous fisheye frames, which enables more accurate pixel-level depth cues to be captured under severe distortions. Based on the constructed cost volume and the initial depth estimated by the pretrained coarse depth predictor, the depth self-updating module is designed to self-update the depth map in an iterative manner. Extensive experimental results on 3 fisheye datasets demonstrate that the proposed method significantly outperforms 14 state-of-the-art methods for fisheye monocular depth estimation.
近年来,鱼眼相机的自监督单目深度估计因其大视场范围而备受关注。然而,由于鱼眼图像不可避免的严重失真,现有方法的性能普遍受到限制。为了解决这个问题,我们提出了一种畸变感知深度自更新网络,用于自监督鱼眼单目深度估计,称为DDS-Net。所提出的DDS-Net方法采用了一种从粗到细的学习策略,其中一个用于预测最终深度的精细深度预测器与预训练的粗深度预测器预测的场景深度进行优化。精细深度预测器包含一个扭曲感知的鱼眼成本体积构建模块和深度自更新模块。畸变感知鱼眼代价体构建模块通过学习连续鱼眼帧之间对应的特征匹配代价来构建鱼眼代价体,从而在严重失真的情况下获取更准确的像素级深度线索。基于构建的代价体积和预训练粗深度预测器估计的初始深度,设计深度自更新模块,以迭代的方式自更新深度图。在3个鱼眼数据集上的大量实验结果表明,该方法明显优于目前最先进的14种鱼眼单目深度估计方法。
{"title":"Distortion-Aware Depth Self-Updating for Self-Supervised Fisheye Monocular Depth Estimation","authors":"Yihang Xu;Qiulei Dong","doi":"10.1109/TIP.2026.3661813","DOIUrl":"10.1109/TIP.2026.3661813","url":null,"abstract":"Self-supervised monocular depth estimation for fisheye cameras has attracted much attention in recent years due to their large view range. However, the performances of existing methods in this field are generally limited due to the inevitable severe distortions in fisheye images. To address this problem, we propose a distortion-aware depth self-updating network for self-supervised fisheye monocular depth estimation called DDS-Net. The proposed DDS-Net method employs a coarse-to-fine learning strategy, in which an explored fine depth predictor for predicting final depth is optimized with the predicted scene depths by a pretrained coarse depth predictor. The fine depth predictor contains a distortion-aware fisheye cost volume construction module and a depth self-updating module. The distortion-aware fisheye cost volume construction module is designed to construct a fisheye cost volume by learning the corresponding feature matching cost between continuous fisheye frames, which enables more accurate pixel-level depth cues to be captured under severe distortions. Based on the constructed cost volume and the initial depth estimated by the pretrained coarse depth predictor, the depth self-updating module is designed to self-update the depth map in an iterative manner. Extensive experimental results on 3 fisheye datasets demonstrate that the proposed method significantly outperforms 14 state-of-the-art methods for fisheye monocular depth estimation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1883-1898"},"PeriodicalIF":13.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep LoRA-Unfolding Networks for Image Restoration 用于图像恢复的深度lora展开网络。
IF 13.7 Pub Date : 2026-02-10 DOI: 10.1109/TIP.2026.3661406
Xiangming Wang;Haijin Zeng;Benteng Sun;Jiezhang Cao;Kai Zhang;Qiangqiang Shen;Yongyong Chen
Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and super-resolution. It unfolds the iterative optimization steps into a stack of sequentially linked blocks. Each block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known level. However, existing DUNs suffer from two critical limitations: 1) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and 2) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained scenarios. To address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient DUN. LoRun introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding step. This design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to $N$ times parameter reduction for an $N$ -stage DUN with on-par or better performance. Extensive experiments conducted on three IR tasks validate the efficiency of our method.
深度展开网络(DUNs)将传统的迭代优化算法和深度神经网络结合成一个多阶段的框架,在光谱成像重建、压缩感知和超分辨率等图像恢复(IR)领域取得了显著的成就。它将迭代优化步骤展开为顺序链接块的堆栈。每个块由一个梯度下降模块(GDM)和一个邻域映射模块(PMM)组成,从贝叶斯的角度来看,邻域映射模块相当于一个去噪器,作用于具有已知水平的高斯噪声。然而,现有的DUNs存在两个关键限制:(i)它们的PMMs在各个阶段共享相同的架构和去噪目标,忽略了对不同噪声水平的特定阶段适应的需要;(ii)它们的结构重复块链导致严重的参数冗余和高内存消耗,阻碍了在大规模或资源受限场景下的部署。为了解决这些挑战,我们引入了用于图像恢复的广义深度低秩自适应(LoRA)展开网络,称为LoRun,它协调去噪目标,并在压缩内存使用的情况下适应不同阶段之间的不同去噪水平,以实现更高效的DUN。LoRun引入了一种新颖的范例,在所有阶段共享单个预训练的基础去噪器,同时将轻量级的,特定于阶段的LoRA适配器注入PMMs,根据每个展开步骤的噪声水平动态调制去噪行为。这种设计将核心恢复能力与特定任务的自适应解耦,能够在不重复整个网络参数的情况下精确控制去噪强度,并为N级DUN实现高达N倍的参数降低,具有同等或更好的性能。在三个红外任务上进行的大量实验验证了我们方法的有效性。
{"title":"Deep LoRA-Unfolding Networks for Image Restoration","authors":"Xiangming Wang;Haijin Zeng;Benteng Sun;Jiezhang Cao;Kai Zhang;Qiangqiang Shen;Yongyong Chen","doi":"10.1109/TIP.2026.3661406","DOIUrl":"10.1109/TIP.2026.3661406","url":null,"abstract":"Deep unfolding networks (DUNs), combining conventional iterative optimization algorithms and deep neural networks into a multi-stage framework, have achieved remarkable accomplishments in Image Restoration (IR), such as spectral imaging reconstruction, compressive sensing and super-resolution. It unfolds the iterative optimization steps into a stack of sequentially linked blocks. Each block consists of a Gradient Descent Module (GDM) and a Proximal Mapping Module (PMM) which is equivalent to a denoiser from a Bayesian perspective, operating on Gaussian noise with a known level. However, existing DUNs suffer from two critical limitations: 1) their PMMs share identical architectures and denoising objectives across stages, ignoring the need for stage-specific adaptation to varying noise levels; and 2) their chain of structurally repetitive blocks results in severe parameter redundancy and high memory consumption, hindering deployment in large-scale or resource-constrained scenarios. To address these challenges, we introduce generalized Deep Low-rank Adaptation (LoRA) Unfolding Networks for image restoration, named LoRun, harmonizing denoising objectives and adapting different denoising levels between stages with compressed memory usage for more efficient DUN. LoRun introduces a novel paradigm where a single pretrained base denoiser is shared across all stages, while lightweight, stage-specific LoRA adapters are injected into the PMMs to dynamically modulate denoising behavior according to the noise level at each unfolding step. This design decouples the core restoration capability from task-specific adaptation, enabling precise control over denoising intensity without duplicating full network parameters and achieving up to <inline-formula> <tex-math>$N$ </tex-math></inline-formula> times parameter reduction for an <inline-formula> <tex-math>$N$ </tex-math></inline-formula>-stage DUN with on-par or better performance. Extensive experiments conducted on three IR tasks validate the efficiency of our method.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1858-1869"},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
You Only Train Once: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment. 你只训练一次:全参考和无参考图像质量评估的统一框架。
IF 13.7 Pub Date : 2026-02-10 DOI: 10.1109/TIP.2026.3661408
Yi Ke Yun, Weisi Lin

Existing Image Quality Assessment (IQA) models are limited to either full reference or no reference evaluation tasks, while humans can seamlessly switch between these assessment types. This motivates us to explore resolving these two tasks using a versatile model. In this work, we propose a novel framework that unifies full reference and no reference IQA. Our approach utilizes an encoder to extract multi-level features from images and introduces a Hierarchical Attention module to adaptively handle spatial distortions for both full reference and no reference inputs. Additionally, we develop a Semantic Distortion Aware module to analyze feature correlations between shallow and deep layers of the encoder, thereby accounting for the varying effects of different distortions on these layers. Our proposed framework achieves state-of-the-art performance for both full-reference and no-reference IQA tasks when trained separately. Furthermore, when the model is trained jointly on both types of tasks, it not only enhances performance in no-reference IQA but also maintains competitive results in full-reference IQA. This integrated approach facilitates a single training process that efficiently addresses both IQA tasks, representing a significant advancement in model versatility and performance.

现有的图像质量评估(IQA)模型仅限于完全参考或无参考评估任务,而人类可以在这些评估类型之间无缝切换。这促使我们探索使用一个通用模型来解决这两个任务。在这项工作中,我们提出了一个新的框架,统一全参考和无参考IQA。我们的方法利用编码器从图像中提取多层次特征,并引入分层注意模块来自适应处理完整参考和无参考输入的空间扭曲。此外,我们开发了语义失真感知模块来分析编码器的浅层和深层之间的特征相关性,从而解释不同失真对这些层的不同影响。当单独训练时,我们提出的框架实现了全参考和无参考IQA任务的最先进性能。此外,当模型在两种类型的任务上联合训练时,它不仅在无参考IQA中提高了性能,而且在全参考IQA中保持了竞争结果。这种集成的方法促进了单一的训练过程,有效地解决了两个IQA任务,代表了模型多功能性和性能的重大进步。
{"title":"You Only Train Once: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment.","authors":"Yi Ke Yun, Weisi Lin","doi":"10.1109/TIP.2026.3661408","DOIUrl":"https://doi.org/10.1109/TIP.2026.3661408","url":null,"abstract":"<p><p>Existing Image Quality Assessment (IQA) models are limited to either full reference or no reference evaluation tasks, while humans can seamlessly switch between these assessment types. This motivates us to explore resolving these two tasks using a versatile model. In this work, we propose a novel framework that unifies full reference and no reference IQA. Our approach utilizes an encoder to extract multi-level features from images and introduces a Hierarchical Attention module to adaptively handle spatial distortions for both full reference and no reference inputs. Additionally, we develop a Semantic Distortion Aware module to analyze feature correlations between shallow and deep layers of the encoder, thereby accounting for the varying effects of different distortions on these layers. Our proposed framework achieves state-of-the-art performance for both full-reference and no-reference IQA tasks when trained separately. Furthermore, when the model is trained jointly on both types of tasks, it not only enhances performance in no-reference IQA but also maintains competitive results in full-reference IQA. This integrated approach facilitates a single training process that efficiently addresses both IQA tasks, representing a significant advancement in model versatility and performance.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PestScope: Exclusion-Aware Large Multimodal Model for Fine-Grained Agricultural Pest Segmentation Pest scope:用于细粒度农业害虫分割的排除感知大型多模态模型。
IF 13.7 Pub Date : 2026-02-10 DOI: 10.1109/TIP.2026.3661417
Yang Yang;Huibin Luo;Haotian Wang;Jingchi Jiang;Jie Liu;Jian Wei;Ming Fang
Reasoning segmentation (RS) interprets implicit textual instructions to accurately segment target regions. This reasoning capability transforms ambiguous non-expert queries into precise pixel-level masks, thereby enabling downstream tasks like area measurement and density analysis with a level of precision unattainable by detection methods. However, existing RS models are not tailored for agriculture and lack domain-specific knowledge, which poses challenges in handling similar pest appearances and small target scales. To bridge this gap, we introduce a fine-grained pest RS task with two subtasks: Pest Discriminative Referring Expression Segmentation (PDRES) and Pest Exclusion Reasoning Segmentation (PERS). Based on this, we propose PestScope, which integrates vision, language, and reasoning for fine-grained pest segmentation. To tackle the exclusion of small non-target pests, we introduce a dedicated [NON] token alongside the standard [SEG] token for target pests. This guides the model to prioritize small target pests and suppress non-target background regions. To further address pest similarity, we propose an Exclusivity Suppression Loss, applying differentiated supervision to [SEG] and [NON] tokens to better separate target and non-target pests. Additionally, we develop an automated dataset construction pipeline to address the scarcity of fine-grained, difficulty-controllable pest RS datasets. It produces 45k and 27.6k image-text-mask samples for the PDRES and PERS tasks, respectively, covering 18 pest categories. Experiments show that in small and similar pest scenarios, integrating PestScope into mainstream models improves average gIoU by 4.28% on PDRES and 6.49% on PERS. For unseen pest categories, gIoU increases by 21.72% and 8.66%, respectively, demonstrating strong generalization. Code and datasets will be available at: https://github.com/aluodaydayup/PestScope
推理分割(RS)是一种通过解释隐含文本指令来精确分割目标区域的方法。这种推理能力将模糊的非专家查询转换为精确的像素级掩码,从而使下游任务(如面积测量和密度分析)具有检测方法无法达到的精度。然而,现有的RS模型不是为农业量身定制的,并且缺乏特定领域的知识,这在处理类似的害虫外观和小目标规模方面提出了挑战。为了弥补这一差距,我们引入了一个细粒度的害虫RS任务,其中包含两个子任务:害虫判别参考表达分割(PDRES)和害虫排除推理分割(PERS)。基于此,我们提出了集视觉、语言和推理于一体的pest scope,用于细粒度害虫分割。为了解决小型非目标害虫的排除问题,我们在目标害虫的标准[SEG]令牌旁边引入了专用的[NON]令牌。这指导模型优先考虑小目标害虫和抑制非目标背景区域。为了进一步解决害虫相似性问题,我们提出了排他性抑制损失,对[SEG]和[NON]代币应用差异化监管,以更好地区分目标和非目标害虫。此外,我们开发了一个自动化的数据集构建管道,以解决细粒度,难以控制的害虫RS数据集的稀缺性。它分别为PDRES和PERS任务生成了45k和27.6k图像-文本掩码样本,涵盖了18个害虫类别。实验表明,在小型和类似害虫场景中,将pest scope集成到主流模型中,PDRES和PERS的平均gIoU分别提高了4.28%和6.49%。对未见害虫分类,gIoU分别增加了21.72%和8.66%,具有较强的泛化性。代码和数据集可在:https://github.com/aluodaydayup/PestScope。
{"title":"PestScope: Exclusion-Aware Large Multimodal Model for Fine-Grained Agricultural Pest Segmentation","authors":"Yang Yang;Huibin Luo;Haotian Wang;Jingchi Jiang;Jie Liu;Jian Wei;Ming Fang","doi":"10.1109/TIP.2026.3661417","DOIUrl":"10.1109/TIP.2026.3661417","url":null,"abstract":"Reasoning segmentation (RS) interprets implicit textual instructions to accurately segment target regions. This reasoning capability transforms ambiguous non-expert queries into precise pixel-level masks, thereby enabling downstream tasks like area measurement and density analysis with a level of precision unattainable by detection methods. However, existing RS models are not tailored for agriculture and lack domain-specific knowledge, which poses challenges in handling similar pest appearances and small target scales. To bridge this gap, we introduce a fine-grained pest RS task with two subtasks: Pest Discriminative Referring Expression Segmentation (PDRES) and Pest Exclusion Reasoning Segmentation (PERS). Based on this, we propose PestScope, which integrates vision, language, and reasoning for fine-grained pest segmentation. To tackle the exclusion of small non-target pests, we introduce a dedicated [NON] token alongside the standard [SEG] token for target pests. This guides the model to prioritize small target pests and suppress non-target background regions. To further address pest similarity, we propose an Exclusivity Suppression Loss, applying differentiated supervision to [SEG] and [NON] tokens to better separate target and non-target pests. Additionally, we develop an automated dataset construction pipeline to address the scarcity of fine-grained, difficulty-controllable pest RS datasets. It produces 45k and 27.6k image-text-mask samples for the PDRES and PERS tasks, respectively, covering 18 pest categories. Experiments show that in small and similar pest scenarios, integrating PestScope into mainstream models improves average gIoU by 4.28% on PDRES and 6.49% on PERS. For unseen pest categories, gIoU increases by 21.72% and 8.66%, respectively, demonstrating strong generalization. Code and datasets will be available at: <uri>https://github.com/aluodaydayup/PestScope</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2034-2049"},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Monocular Multi-Object 3D Visual Language Tracking 单目多目标三维视觉语言跟踪。
IF 13.7 Pub Date : 2026-02-10 DOI: 10.1109/TIP.2026.3661407
Hongkai Wei;Rong Wang;Haixiang Hu;Shijie Sun;Xiangyu Song;Mingtao Feng;Keyu Guo;Yongle Huang;Hua Cui;Naveed Akhtar
Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at https://github.com/hongkai-wei/MoMo-3DVLT.
视觉语言跟踪(VLT)使机器能够通过类似人类的语言描述在现实世界中进行跟踪。然而,现有的VLT方法仅限于二维空间跟踪或单目标3D跟踪,不支持单目视频中的多目标3D跟踪。出现这种限制是因为3D多目标跟踪的进步主要依赖于缺乏相应语言描述的基于传感器的数据(例如,点云,深度传感器)。此外,现有VLT文献中的自然语言描述往往存在冗余,阻碍了多目标的高效、精确定位。我们提出了第一种技术,将VLT扩展到使用单目视频的多目标3D跟踪。我们介绍了一个全面的框架,其中包括(i)单目多目标3D视觉语言跟踪(MoMo-3DVLT)任务,(ii)为该任务量身定制的大规模数据集MoMo-3DRoVLT,以及(iii)自定义神经模型。我们的数据集是在大型语言模型(llm)和人工验证的帮助下生成的,包含8216个用2D和3D边界框注释的视频序列,每个序列都伴随着三个自由生成的、人类级别的文本描述。我们提出了MoMo-3DVLTracker,这是第一个专门为MoMo-3DVLT设计的神经模型。该模型集成了多模态特征提取器、视觉语言编码器-解码器以及检测和跟踪模块,为MoMo-3DVLT设置了强大的基线。在现有范例之外,它引入了一种特定于任务的结构耦合,该耦合将可微分链接记忆机制与深度引导和语言条件推理集成在一起,用于稳健的单目3D多目标跟踪。实验结果表明,该方法在MoMo-3DRoVLT数据集上优于现有方法。我们的数据集和代码可以在Github上获得。
{"title":"Monocular Multi-Object 3D Visual Language Tracking","authors":"Hongkai Wei;Rong Wang;Haixiang Hu;Shijie Sun;Xiangyu Song;Mingtao Feng;Keyu Guo;Yongle Huang;Hua Cui;Naveed Akhtar","doi":"10.1109/TIP.2026.3661407","DOIUrl":"10.1109/TIP.2026.3661407","url":null,"abstract":"Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at <uri>https://github.com/hongkai-wei/MoMo-3DVLT</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2050-2065"},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Retinex Prior for Compressive Hyperspectral Image Reconstruction 基于先验学习的压缩高光谱图像重建。
IF 13.7 Pub Date : 2026-02-09 DOI: 10.1109/TIP.2026.3659746
Mengzu Liu;Junwei Xu;Weisheng Dong;Le Dong;Guangming Shi
Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at https://github.com/ZUGE0312/RPDUN
编码孔径快照光谱压缩成像(CASSI)中的图像重建旨在从压缩的二维测量中恢复高保真高光谱图像(hsi)。虽然深度展开网络表现出了良好的性能,但CASSI退化模型引起的退化通常会在重建中引入全局光照差异,从而产生类似于低光图像的伪影。为了解决这些挑战,我们提出了一种新的Retinex先验驱动的展开网络(RPDUN),它将Retinex先验作为正则化项展开到一个多阶段网络中。该设计为压缩测量提供全局照明调整,根据物理调制有效补偿空间光谱退化并捕获固有光谱特性。据我们所知,这是Retinex先验在高光谱图像重建中的首次应用。此外,为了减轻在分解过程中可能被放大的反射域噪声,我们引入了自适应令牌选择变压器(ATST)。该模块在自关注计算前自适应滤除弱相关令牌,有效地降低了恢复的反射率图中的噪声和伪影。在模拟和真实数据集上的大量实验表明,RPDUN实现了新的最先进的性能,在保持计算效率的同时显着提高了重建质量。代码可在https://github.com/ZUGE0312/RPDUN上获得。
{"title":"Learning Retinex Prior for Compressive Hyperspectral Image Reconstruction","authors":"Mengzu Liu;Junwei Xu;Weisheng Dong;Le Dong;Guangming Shi","doi":"10.1109/TIP.2026.3659746","DOIUrl":"10.1109/TIP.2026.3659746","url":null,"abstract":"Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at <uri>https://github.com/ZUGE0312/RPDUN</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1786-1801"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion 基于相机的三维语义场景补全中体素稀疏度的多分辨率对齐。
IF 13.7 Pub Date : 2026-02-09 DOI: 10.1109/TIP.2026.3660576
Zhiwen Yang;Yuxin Peng
Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.
基于摄像头的3D语义场景补全(SSC)提供了一种具有成本效益的解决方案,可通过图像输入评估周围3D场景中每个体素的几何占用率和语义标签,为感知-预测-规划自动驾驶系统提供体素级场景感知基础。虽然现有的方法已经取得了很大的进展,但它们的优化仅仅依赖于体素标签的监督,并且面临着体素稀疏性的挑战,因为自动驾驶场景中很大一部分体素是空的,这限制了优化效率和模型性能。为了解决这个问题,我们提出了一种多分辨率对齐(MRA)方法来缓解基于相机的3D语义场景补全中的体素稀疏性,该方法利用多分辨率3D特征之间的场景和实例级对齐作为辅助监督。具体而言,我们首先提出了多分辨率视图转换模块,该模块将2D图像特征投影到多分辨率3D特征中,并通过融合判别种子特征在场景级对其进行对齐。此外,我们设计了立方体语义各向异性模块来识别每个体素的实例级语义重要性,考虑特定体素与其相邻体素在立方体区域内的语义差异。最后,我们设计了一个关键分布对齐模块,该模块在三次语义各向异性的指导下选择关键体素作为实例级锚点,并应用循环损失对不同分辨率下的关键特征分布一致性进行辅助监督。在SemanticKITTI和sschbench - kitti -360数据集上的大量实验表明,我们的MRA方法显著优于现有的最先进的方法,展示了其在减轻稀疏体素标签影响方面的有效性。代码可在https://github.com/PKU-ICST-MIPL/MRA_TIP上获得。
{"title":"Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion","authors":"Zhiwen Yang;Yuxin Peng","doi":"10.1109/TIP.2026.3660576","DOIUrl":"10.1109/TIP.2026.3660576","url":null,"abstract":"Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at <uri>https://github.com/PKU-ICST-MIPL/MRA_TIP.</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1771-1785"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FreeStyle: Toward Style-Inclusive Sketch-Based Person Retrieval FreeStyle:面向包含风格的基于草图的人物检索。
IF 13.7 Pub Date : 2026-02-09 DOI: 10.1109/TIP.2026.3660575
Xinyi Wu;Cuiqun Chen;Hui Zeng;Zhiping Cai;Mang Ye
Sketch-based Person Retrieval (SBPR) aims to identify and retrieve a target individual across non-overlapping camera views using professional sketches as queries. In practice, sketches drawn by different artists often present diverse painting styles unpredictably. The substantial style variations among sketches pose significant challenges to the stability and generalizability of SBPR models. Prior works attempt to mitigate style variations through style manipulation methods, which inevitably undermine the inherent structural relations among multiple sketch features. This leads to overfitting on existing training styles and struggles with generalizing to new, unseen sketch styles. In this paper, we introduce FreeStyle, an innovative style-inclusive framework for SBPR, built upon the foundational CLIP architecture. FreeStyle explicitly models the relations across diverse sketch styles via style consistency enhancement, enabling dynamic adaptation to both seen and unseen style variations. Specifically, Diverse Style Semantic Unification is first devised to enhance the style consistency of each identity at the semantic level by introducing objective attribute-level semantic constraints. Meanwhile, Diverse Style Feature Squeezing tackles unclear feature boundaries among identities by concentrating the intra-identity space and separating the inter-identity space, thereby strengthening style consistency at the feature representation level. Additionally, considering the feature distribution discrepancy between sketches and photos, an identity-centric cross-modal prototype alignment mechanism is introduced to facilitate identity-aware cross-modal associations and promote a compact joint embedding space. Extensive experiments validate that FreeStyle not only achieves stable performance under seen style variations but also demonstrates strong generalization to unseen sketch styles.
基于草图的人物检索(SBPR)旨在使用专业草图作为查询,在非重叠的相机视图中识别和检索目标个人。在实践中,不同艺术家所画的速写往往呈现出不可预测的不同绘画风格。草图之间的大量风格变化对SBPR模型的稳定性和可泛化性提出了重大挑战。先前的作品试图通过风格操纵方法来缓解风格变化,这不可避免地破坏了多个素描特征之间固有的结构关系。这导致过度拟合现有的训练风格,并努力推广新的,看不见的草图风格。在本文中,我们介绍了FreeStyle,这是一种基于基础CLIP架构的SBPR创新风格包容性框架。FreeStyle通过增强风格一致性显式地对不同草图风格之间的关系进行建模,从而能够动态适应已见和未见的风格变化。具体而言,首先设计了多元风格语义统一,通过引入客观属性级语义约束来增强每个标识在语义层面的风格一致性。同时,多元风格特征压缩通过集中身份内空间和分离身份间空间来解决身份间特征边界不清的问题,从而在特征表示层面加强风格一致性。此外,考虑到草图和照片之间的特征分布差异,引入了以身份为中心的跨模态原型对齐机制,以促进身份感知的跨模态关联,促进紧密的联合嵌入空间。大量的实验证明,FreeStyle不仅在可见的风格变化下获得稳定的表现,而且对未见的素描风格表现出很强的泛化能力。
{"title":"FreeStyle: Toward Style-Inclusive Sketch-Based Person Retrieval","authors":"Xinyi Wu;Cuiqun Chen;Hui Zeng;Zhiping Cai;Mang Ye","doi":"10.1109/TIP.2026.3660575","DOIUrl":"10.1109/TIP.2026.3660575","url":null,"abstract":"Sketch-based Person Retrieval (SBPR) aims to identify and retrieve a target individual across non-overlapping camera views using professional sketches as queries. In practice, sketches drawn by different artists often present diverse painting styles unpredictably. The substantial style variations among sketches pose significant challenges to the stability and generalizability of SBPR models. Prior works attempt to mitigate style variations through style manipulation methods, which inevitably undermine the inherent structural relations among multiple sketch features. This leads to overfitting on existing training styles and struggles with generalizing to new, unseen sketch styles. In this paper, we introduce FreeStyle, an innovative style-inclusive framework for SBPR, built upon the foundational CLIP architecture. FreeStyle explicitly models the relations across diverse sketch styles via style consistency enhancement, enabling dynamic adaptation to both seen and unseen style variations. Specifically, Diverse Style Semantic Unification is first devised to enhance the style consistency of each identity at the semantic level by introducing objective attribute-level semantic constraints. Meanwhile, Diverse Style Feature Squeezing tackles unclear feature boundaries among identities by concentrating the intra-identity space and separating the inter-identity space, thereby strengthening style consistency at the feature representation level. Additionally, considering the feature distribution discrepancy between sketches and photos, an identity-centric cross-modal prototype alignment mechanism is introduced to facilitate identity-aware cross-modal associations and promote a compact joint embedding space. Extensive experiments validate that FreeStyle not only achieves stable performance under seen style variations but also demonstrates strong generalization to unseen sketch styles.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1977-1992"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining 开放手术视频语言预训练的程序感知分层对齐。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659752
Boqiang Xu;Jinlin Wu;Jian Liang;Zhenan Sun;Hongbin Liu;Jiebo Luo;Zhen Lei
Recent advances in surgical robotics and computer vision have greatly improved intelligent systems’ autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video–text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.
外科机器人技术和计算机视觉的最新进展极大地提高了智能系统在手术室(OR)中的自主性和感知能力,特别是在内窥镜和微创手术中。然而,开放手术仍然是世界范围内主要的手术干预形式,由于其固有的复杂性和缺乏大规模、多样化的数据集,其探索相对有限。为了缩小这一差距,我们提出了OpenSurgery,这是迄今为止最大的用于开放手术理解的视频文本预训练和评估数据集。OpenSurgery包括两个子集:OpenSurgery- pretrain和OpenSurgery- eval。OpenSurgery-Pretrain由843个公开的开放式手术视频组成,用于预训练,跨越102小时,涵盖20多种不同的手术类型。OpenSurgery-EVAL是用于评估开放手术理解模型性能的基准数据集,包括280个训练视频和120个测试视频,总计49小时。OpenSurgery的每一个视频都由专家医生从视频、操作、帧三个层次进行精心注释,保证了高质量和较强的临床适用性。接下来,我们提出了分层外科知识预训练(HierSKP)框架,以促进开放手术理解的大规模多模态表示学习。HierSKP利用粒度感知的对比学习策略,通过构建硬负样本和结合基于动态时间扭曲(DTW)的损失来捕获视觉语义的细粒度时间对齐,从而增强程序理解。大量实验表明,HierSKP在opensurgical - eval上实现了最先进的多任务性能,包括操作识别、时间动作定位和零射击跨模态检索。这证明了它对进一步推进开放手术的理解具有很强的通用性。
{"title":"Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining","authors":"Boqiang Xu;Jinlin Wu;Jian Liang;Zhenan Sun;Hongbin Liu;Jiebo Luo;Zhen Lei","doi":"10.1109/TIP.2026.3659752","DOIUrl":"10.1109/TIP.2026.3659752","url":null,"abstract":"Recent advances in surgical robotics and computer vision have greatly improved intelligent systems’ autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video–text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1966-1976"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1