首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
Monocular Multi-Object 3D Visual Language Tracking 单目多目标三维视觉语言跟踪。
IF 13.7 Pub Date : 2026-02-10 DOI: 10.1109/TIP.2026.3661407
Hongkai Wei;Rong Wang;Haixiang Hu;Shijie Sun;Xiangyu Song;Mingtao Feng;Keyu Guo;Yongle Huang;Hua Cui;Naveed Akhtar
Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at https://github.com/hongkai-wei/MoMo-3DVLT.
视觉语言跟踪(VLT)使机器能够通过类似人类的语言描述在现实世界中进行跟踪。然而,现有的VLT方法仅限于二维空间跟踪或单目标3D跟踪,不支持单目视频中的多目标3D跟踪。出现这种限制是因为3D多目标跟踪的进步主要依赖于缺乏相应语言描述的基于传感器的数据(例如,点云,深度传感器)。此外,现有VLT文献中的自然语言描述往往存在冗余,阻碍了多目标的高效、精确定位。我们提出了第一种技术,将VLT扩展到使用单目视频的多目标3D跟踪。我们介绍了一个全面的框架,其中包括(i)单目多目标3D视觉语言跟踪(MoMo-3DVLT)任务,(ii)为该任务量身定制的大规模数据集MoMo-3DRoVLT,以及(iii)自定义神经模型。我们的数据集是在大型语言模型(llm)和人工验证的帮助下生成的,包含8216个用2D和3D边界框注释的视频序列,每个序列都伴随着三个自由生成的、人类级别的文本描述。我们提出了MoMo-3DVLTracker,这是第一个专门为MoMo-3DVLT设计的神经模型。该模型集成了多模态特征提取器、视觉语言编码器-解码器以及检测和跟踪模块,为MoMo-3DVLT设置了强大的基线。在现有范例之外,它引入了一种特定于任务的结构耦合,该耦合将可微分链接记忆机制与深度引导和语言条件推理集成在一起,用于稳健的单目3D多目标跟踪。实验结果表明,该方法在MoMo-3DRoVLT数据集上优于现有方法。我们的数据集和代码可以在Github上获得。
{"title":"Monocular Multi-Object 3D Visual Language Tracking","authors":"Hongkai Wei;Rong Wang;Haixiang Hu;Shijie Sun;Xiangyu Song;Mingtao Feng;Keyu Guo;Yongle Huang;Hua Cui;Naveed Akhtar","doi":"10.1109/TIP.2026.3661407","DOIUrl":"10.1109/TIP.2026.3661407","url":null,"abstract":"Visual Language Tracking (VLT) enables machines to perform tracking in real world through human-like language descriptions. However, existing VLT methods are limited to 2D spatial tracking or single-object 3D tracking and do not support multi-object 3D tracking within monocular video. This limitation arises because advancements in 3D multi-object tracking have predominantly relied on sensor-based data (e.g., point clouds, depth sensors) that lacks corresponding language descriptions. Moreover, natural language descriptions in existing VLT literature often suffer from redundancy, impeding the efficient and precise localization of multiple objects. We present the first technique to extend VLT to multi-object 3D tracking using monocular video. We introduce a comprehensive framework that includes (i) a Monocular Multi-object 3D Visual Language Tracking (MoMo-3DVLT) task, (ii) a large-scale dataset, MoMo-3DRoVLT, tailored for this task, and (iii) a custom neural model. Our dataset, generated with the aid of Large Language Models (LLMs) and manual verification, contains 8,216 video sequences annotated with both 2D and 3D bounding boxes, with each sequence accompanied by three freely generated, human-level textual descriptions. We propose MoMo-3DVLTracker, the first neural model specifically designed for MoMo-3DVLT. This model integrates a multimodal feature extractor, a visual language encoder-decoder, and modules for detection and tracking, setting a strong baseline for MoMo-3DVLT. Beyond existing paradigms, it introduces a task-specific structural coupling that integrates a differentiable linked-memory mechanism with depth-guided and language-conditioned reasoning for robust monocular 3D multi-object tracking. Experimental results demonstrate that our approach outperforms existing methods on the MoMo-3DRoVLT dataset. Our dataset and code are available at <uri>https://github.com/hongkai-wei/MoMo-3DVLT</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2050-2065"},"PeriodicalIF":13.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Retinex Prior for Compressive Hyperspectral Image Reconstruction 基于先验学习的压缩高光谱图像重建。
IF 13.7 Pub Date : 2026-02-09 DOI: 10.1109/TIP.2026.3659746
Mengzu Liu;Junwei Xu;Weisheng Dong;Le Dong;Guangming Shi
Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at https://github.com/ZUGE0312/RPDUN
编码孔径快照光谱压缩成像(CASSI)中的图像重建旨在从压缩的二维测量中恢复高保真高光谱图像(hsi)。虽然深度展开网络表现出了良好的性能,但CASSI退化模型引起的退化通常会在重建中引入全局光照差异,从而产生类似于低光图像的伪影。为了解决这些挑战,我们提出了一种新的Retinex先验驱动的展开网络(RPDUN),它将Retinex先验作为正则化项展开到一个多阶段网络中。该设计为压缩测量提供全局照明调整,根据物理调制有效补偿空间光谱退化并捕获固有光谱特性。据我们所知,这是Retinex先验在高光谱图像重建中的首次应用。此外,为了减轻在分解过程中可能被放大的反射域噪声,我们引入了自适应令牌选择变压器(ATST)。该模块在自关注计算前自适应滤除弱相关令牌,有效地降低了恢复的反射率图中的噪声和伪影。在模拟和真实数据集上的大量实验表明,RPDUN实现了新的最先进的性能,在保持计算效率的同时显着提高了重建质量。代码可在https://github.com/ZUGE0312/RPDUN上获得。
{"title":"Learning Retinex Prior for Compressive Hyperspectral Image Reconstruction","authors":"Mengzu Liu;Junwei Xu;Weisheng Dong;Le Dong;Guangming Shi","doi":"10.1109/TIP.2026.3659746","DOIUrl":"10.1109/TIP.2026.3659746","url":null,"abstract":"Image reconstruction in coded aperture snapshot spectral compressive imaging (CASSI) aims to recover high-fidelity hyperspectral images (HSIs) from compressed 2D measurements. While deep unfolding networks have shown promising performance, the degradation induced by the CASSI degradation model often introduces global illumination discrepancies in the reconstructions, creating artifacts similar to those in low-light images. To address these challenges, we propose a novel Retinex Prior-Driven Unfolding Network (RPDUN), which unfolds the optimization incorporating the Retinex prior as a regularization term into a multi-stage network. This design provides global illumination adjustment for compressed measurements, effectively compensating for spatial-spectral degradation according to physical modulation and capturing intrinsic spectral characteristics. To the best of our knowledge, this is the first application of the Retinex prior in hyperspectral image reconstruction. Furthermore, to mitigate the noise in the reflectance domain, which can be amplified during decomposition, we introduce an Adaptive Token Selection Transformer (ATST). This module adaptively filters out weakly correlated tokens before the self-attention computation, effectively reducing noise and artifacts within the recovered reflectance map. Extensive experiments on both simulated and real-world datasets demonstrate that RPDUN achieves new state-of-the-art performance, significantly improving reconstruction quality while maintaining computational efficiency. The code is available at <uri>https://github.com/ZUGE0312/RPDUN</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1786-1801"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion 基于相机的三维语义场景补全中体素稀疏度的多分辨率对齐。
IF 13.7 Pub Date : 2026-02-09 DOI: 10.1109/TIP.2026.3660576
Zhiwen Yang;Yuxin Peng
Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.
基于摄像头的3D语义场景补全(SSC)提供了一种具有成本效益的解决方案,可通过图像输入评估周围3D场景中每个体素的几何占用率和语义标签,为感知-预测-规划自动驾驶系统提供体素级场景感知基础。虽然现有的方法已经取得了很大的进展,但它们的优化仅仅依赖于体素标签的监督,并且面临着体素稀疏性的挑战,因为自动驾驶场景中很大一部分体素是空的,这限制了优化效率和模型性能。为了解决这个问题,我们提出了一种多分辨率对齐(MRA)方法来缓解基于相机的3D语义场景补全中的体素稀疏性,该方法利用多分辨率3D特征之间的场景和实例级对齐作为辅助监督。具体而言,我们首先提出了多分辨率视图转换模块,该模块将2D图像特征投影到多分辨率3D特征中,并通过融合判别种子特征在场景级对其进行对齐。此外,我们设计了立方体语义各向异性模块来识别每个体素的实例级语义重要性,考虑特定体素与其相邻体素在立方体区域内的语义差异。最后,我们设计了一个关键分布对齐模块,该模块在三次语义各向异性的指导下选择关键体素作为实例级锚点,并应用循环损失对不同分辨率下的关键特征分布一致性进行辅助监督。在SemanticKITTI和sschbench - kitti -360数据集上的大量实验表明,我们的MRA方法显著优于现有的最先进的方法,展示了其在减轻稀疏体素标签影响方面的有效性。代码可在https://github.com/PKU-ICST-MIPL/MRA_TIP上获得。
{"title":"Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion","authors":"Zhiwen Yang;Yuxin Peng","doi":"10.1109/TIP.2026.3660576","DOIUrl":"10.1109/TIP.2026.3660576","url":null,"abstract":"Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a Multi-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that our MRA approach significantly outperforms existing state-of-the-art methods, showcasing its effectiveness in mitigating the impact of sparse voxel labels. The code is available at <uri>https://github.com/PKU-ICST-MIPL/MRA_TIP.</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1771-1785"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FreeStyle: Toward Style-Inclusive Sketch-Based Person Retrieval FreeStyle:面向包含风格的基于草图的人物检索。
IF 13.7 Pub Date : 2026-02-09 DOI: 10.1109/TIP.2026.3660575
Xinyi Wu;Cuiqun Chen;Hui Zeng;Zhiping Cai;Mang Ye
Sketch-based Person Retrieval (SBPR) aims to identify and retrieve a target individual across non-overlapping camera views using professional sketches as queries. In practice, sketches drawn by different artists often present diverse painting styles unpredictably. The substantial style variations among sketches pose significant challenges to the stability and generalizability of SBPR models. Prior works attempt to mitigate style variations through style manipulation methods, which inevitably undermine the inherent structural relations among multiple sketch features. This leads to overfitting on existing training styles and struggles with generalizing to new, unseen sketch styles. In this paper, we introduce FreeStyle, an innovative style-inclusive framework for SBPR, built upon the foundational CLIP architecture. FreeStyle explicitly models the relations across diverse sketch styles via style consistency enhancement, enabling dynamic adaptation to both seen and unseen style variations. Specifically, Diverse Style Semantic Unification is first devised to enhance the style consistency of each identity at the semantic level by introducing objective attribute-level semantic constraints. Meanwhile, Diverse Style Feature Squeezing tackles unclear feature boundaries among identities by concentrating the intra-identity space and separating the inter-identity space, thereby strengthening style consistency at the feature representation level. Additionally, considering the feature distribution discrepancy between sketches and photos, an identity-centric cross-modal prototype alignment mechanism is introduced to facilitate identity-aware cross-modal associations and promote a compact joint embedding space. Extensive experiments validate that FreeStyle not only achieves stable performance under seen style variations but also demonstrates strong generalization to unseen sketch styles.
基于草图的人物检索(SBPR)旨在使用专业草图作为查询,在非重叠的相机视图中识别和检索目标个人。在实践中,不同艺术家所画的速写往往呈现出不可预测的不同绘画风格。草图之间的大量风格变化对SBPR模型的稳定性和可泛化性提出了重大挑战。先前的作品试图通过风格操纵方法来缓解风格变化,这不可避免地破坏了多个素描特征之间固有的结构关系。这导致过度拟合现有的训练风格,并努力推广新的,看不见的草图风格。在本文中,我们介绍了FreeStyle,这是一种基于基础CLIP架构的SBPR创新风格包容性框架。FreeStyle通过增强风格一致性显式地对不同草图风格之间的关系进行建模,从而能够动态适应已见和未见的风格变化。具体而言,首先设计了多元风格语义统一,通过引入客观属性级语义约束来增强每个标识在语义层面的风格一致性。同时,多元风格特征压缩通过集中身份内空间和分离身份间空间来解决身份间特征边界不清的问题,从而在特征表示层面加强风格一致性。此外,考虑到草图和照片之间的特征分布差异,引入了以身份为中心的跨模态原型对齐机制,以促进身份感知的跨模态关联,促进紧密的联合嵌入空间。大量的实验证明,FreeStyle不仅在可见的风格变化下获得稳定的表现,而且对未见的素描风格表现出很强的泛化能力。
{"title":"FreeStyle: Toward Style-Inclusive Sketch-Based Person Retrieval","authors":"Xinyi Wu;Cuiqun Chen;Hui Zeng;Zhiping Cai;Mang Ye","doi":"10.1109/TIP.2026.3660575","DOIUrl":"10.1109/TIP.2026.3660575","url":null,"abstract":"Sketch-based Person Retrieval (SBPR) aims to identify and retrieve a target individual across non-overlapping camera views using professional sketches as queries. In practice, sketches drawn by different artists often present diverse painting styles unpredictably. The substantial style variations among sketches pose significant challenges to the stability and generalizability of SBPR models. Prior works attempt to mitigate style variations through style manipulation methods, which inevitably undermine the inherent structural relations among multiple sketch features. This leads to overfitting on existing training styles and struggles with generalizing to new, unseen sketch styles. In this paper, we introduce FreeStyle, an innovative style-inclusive framework for SBPR, built upon the foundational CLIP architecture. FreeStyle explicitly models the relations across diverse sketch styles via style consistency enhancement, enabling dynamic adaptation to both seen and unseen style variations. Specifically, Diverse Style Semantic Unification is first devised to enhance the style consistency of each identity at the semantic level by introducing objective attribute-level semantic constraints. Meanwhile, Diverse Style Feature Squeezing tackles unclear feature boundaries among identities by concentrating the intra-identity space and separating the inter-identity space, thereby strengthening style consistency at the feature representation level. Additionally, considering the feature distribution discrepancy between sketches and photos, an identity-centric cross-modal prototype alignment mechanism is introduced to facilitate identity-aware cross-modal associations and promote a compact joint embedding space. Extensive experiments validate that FreeStyle not only achieves stable performance under seen style variations but also demonstrates strong generalization to unseen sketch styles.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1977-1992"},"PeriodicalIF":13.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining 开放手术视频语言预训练的程序感知分层对齐。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659752
Boqiang Xu;Jinlin Wu;Jian Liang;Zhenan Sun;Hongbin Liu;Jiebo Luo;Zhen Lei
Recent advances in surgical robotics and computer vision have greatly improved intelligent systems’ autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video–text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.
外科机器人技术和计算机视觉的最新进展极大地提高了智能系统在手术室(OR)中的自主性和感知能力,特别是在内窥镜和微创手术中。然而,开放手术仍然是世界范围内主要的手术干预形式,由于其固有的复杂性和缺乏大规模、多样化的数据集,其探索相对有限。为了缩小这一差距,我们提出了OpenSurgery,这是迄今为止最大的用于开放手术理解的视频文本预训练和评估数据集。OpenSurgery包括两个子集:OpenSurgery- pretrain和OpenSurgery- eval。OpenSurgery-Pretrain由843个公开的开放式手术视频组成,用于预训练,跨越102小时,涵盖20多种不同的手术类型。OpenSurgery-EVAL是用于评估开放手术理解模型性能的基准数据集,包括280个训练视频和120个测试视频,总计49小时。OpenSurgery的每一个视频都由专家医生从视频、操作、帧三个层次进行精心注释,保证了高质量和较强的临床适用性。接下来,我们提出了分层外科知识预训练(HierSKP)框架,以促进开放手术理解的大规模多模态表示学习。HierSKP利用粒度感知的对比学习策略,通过构建硬负样本和结合基于动态时间扭曲(DTW)的损失来捕获视觉语义的细粒度时间对齐,从而增强程序理解。大量实验表明,HierSKP在opensurgical - eval上实现了最先进的多任务性能,包括操作识别、时间动作定位和零射击跨模态检索。这证明了它对进一步推进开放手术的理解具有很强的通用性。
{"title":"Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining","authors":"Boqiang Xu;Jinlin Wu;Jian Liang;Zhenan Sun;Hongbin Liu;Jiebo Luo;Zhen Lei","doi":"10.1109/TIP.2026.3659752","DOIUrl":"10.1109/TIP.2026.3659752","url":null,"abstract":"Recent advances in surgical robotics and computer vision have greatly improved intelligent systems’ autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video–text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1966-1976"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Foundation Model Empowered Real-Time Video Conference With Semantic Communications 基础模型支持语义通信的实时视频会议。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659719
Mingkai Chen;Wenbo Ma;Mujian Zeng;Xiaoming He;Jian Xiong;Lei Wang;Anwer Al-Dulaimi;Shahid Mumtaz
With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.
随着实时视频会议的发展,交互式多媒体业务激增,导致业务流量激增。交互性将成为未来多媒体服务的主要特征之一,这对通信领域的计算机视觉技术提出了新的挑战。此外,视频中CV的许多方向,如识别、理解、显著性分割、编码等,如果没有整合,就不能满足交互多任务的需求。同时,随着基础模型的快速发展,我们采用面向任务的语义通信来处理它们。为此,我们提出了一种基于基础模型的实时视频会议(RTVCFM)框架,以满足多媒体业务的交互性需求。首先,在发送端,我们利用视频时间感知大语言模型(VTimeLLM)、迭代集成属性(IIA)和片段任意模型2 (SAM2)对交互视频进行因果理解和时空解耦,完成视频语义分割。其次,在传输中,我们提出了一种由信道状态信息(CSI)驱动的两阶段语义传输优化,该优化同样适用于实时视频中不对称语义信息的权重,从而在视频传输中实现低比特率和高语义保真度。第三,在接收端,RTVCFM利用前景背景融合扩散模型(DMFBF)对视频流进行全语义分割的多维融合,重构视频流;最后,仿真结果表明,RTVCFM可以实现高达95.6%的压缩比,同时保证了多尺度结构相似度指标(MS-SSIM)和结构相似度指标(SSIM)的98.73%和98.35%的高语义相似度,表明重构视频与原始视频相对相似。
{"title":"Foundation Model Empowered Real-Time Video Conference With Semantic Communications","authors":"Mingkai Chen;Wenbo Ma;Mujian Zeng;Xiaoming He;Jian Xiong;Lei Wang;Anwer Al-Dulaimi;Shahid Mumtaz","doi":"10.1109/TIP.2026.3659719","DOIUrl":"10.1109/TIP.2026.3659719","url":null,"abstract":"With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1740-1755"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anatomy-Aware MR-Imaging-Only Radiotherapy 解剖意识mr成像放射治疗。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3658010
Hao Yang;Yue Sun;Hui Xie;Lina Zhao;Chi Kin Lam;Qiang Zhao;Xiangyu Xiong;Kunyan Cai;Behdad Dashtbozorg;Chenggang Yan;Tao Tan
The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM
计算机断层图像的合成可以补充电子密度信息,消除核磁共振ct图像配准误差。因此,越来越多的MR-to-CT图像转换方法被提出用于MR-only放疗计划。然而,由于各区域解剖结构的巨大差异,传统方法往往需要每个模型进行独立的开发和使用。在本文中,我们提出了一个由提示符驱动的统一模型,该模型可以动态适应不同的解剖区域,并生成具有高结构一致性的CT图像。具体来说,它利用特定区域的注意机制,包括一个区域感知向量和一个动态门控因子,来实现多个解剖区域的mri到ct图像转换。在三个解剖部位数据集上的定性和定量结果表明,我们的模型比其他最先进的翻译模型产生更清晰、更详细的解剖CT图像。剂量学分析的结果也表明,我们提出的模型产生的图像的剂量分布更接近于真实的CT图像。因此,所提出的模型显示了跨多个解剖区域实现仅磁共振放射治疗的潜力。我们已经发布了RSAM模型的源代码。该存储库可供公众访问:https://github.com/yhyumi123/RSAM。
{"title":"Anatomy-Aware MR-Imaging-Only Radiotherapy","authors":"Hao Yang;Yue Sun;Hui Xie;Lina Zhao;Chi Kin Lam;Qiang Zhao;Xiangyu Xiong;Kunyan Cai;Behdad Dashtbozorg;Chenggang Yan;Tao Tan","doi":"10.1109/TIP.2026.3658010","DOIUrl":"10.1109/TIP.2026.3658010","url":null,"abstract":"The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-of-the-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: <uri>https://github.com/yhyumi123/RSAM</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1680-1695"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications 双非凸张量鲁棒核主成分分析及其可视化应用。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659302
Liang Wu;Jianjun Wang;Wei-Shi Zheng;Guangming Shi
Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten- $p$ norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729_DNTRKPCA_code
张量鲁棒主成分分析(TRPCA)作为一种流行的线性低秩方法,已广泛应用于各种视觉任务。从线性潜变量模型出发,推导了低秩先验的数学过程。然而,对于信息丰富的非线性张量数据,其非线性结构可能会突破低秩假设,导致TRPCA的近似误差较大。基于非线性张量的潜在低维性,本文首次建立了非线性张量加稀疏张量分解问题的一般范式——张量鲁棒核主成分分析(TRKPCA)。为了有效地解决TRKPCA问题,设计了核化张量schattenp范数(kernel - tensor schattenp norm, KTSPN)和广义非凸正则化两种新的非凸正则化方法,其中KTSPN具有更强的理论支持,能够充分捕获非线性特征(即隐式低秩),而广义非凸正则化则保证了更稀疏的结构编码,保证了分离结果的鲁棒性。然后,通过整合它们的优势,我们提出了一种双非凸TRKPCA (DNTRKPCA)方法来实现我们的期望。最后,我们通过交替方向乘子法(ADMM)开发了一个有效的优化框架来实现所提出的非凸核方法。在合成数据和几个真实数据库上的实验结果表明,与其他最先进的正则化方法相比,我们的方法具有更高的竞争力。该代码已在我们的ResearchGate主页上发布:https://www.researchgate.net/publication/397181729 DNTRKPCA代码。
{"title":"Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications","authors":"Liang Wu;Jianjun Wang;Wei-Shi Zheng;Guangming Shi","doi":"10.1109/TIP.2026.3659302","DOIUrl":"10.1109/TIP.2026.3659302","url":null,"abstract":"Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-<inline-formula> <tex-math>$p$ </tex-math></inline-formula> norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: <uri>https://www.researchgate.net/publication/397181729_DNTRKPCA_code</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1711-1726"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes DrivingEditor:用于动态自动驾驶场景重建和编辑的4D复合高斯飞溅。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659733
Wang Xu;Yeqiang Qian;Yun-Fu Liu;Lei Tuo;Huiyong Chen;Ming Yang
In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers’ attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor
近年来,随着自动驾驶技术的发展,无界大场景的三维重建受到了研究人员的关注。现有方法在自动驾驶场景中实现了出色的重建精度,但大多缺乏场景编辑能力。虽然一些方法有编辑场景的能力,但它们高度依赖于手动注释的3D边界框,导致它们的可扩展性很差。为了解决这些问题,我们引入了一种新的高斯表示,称为DrivingEditor,它将场景解耦为两个部分,并通过单独的分支来处理它们,以便在训练过程中单独建模动态前景对象和静态背景。通过提出一种场景解耦建模框架,我们可以实现对任意动态目标的精确编辑,如动态对象的移除、添加等,同时提高自动驾驶场景尤其是动态前景对象的重建质量,而无需借助3D边界框。在Waymo开放数据集和KITTI基准测试上的大量实验证明了动态和静态场景的3D重建性能。此外,我们还对非结构化的大规模场景进行了额外的实验,这可以更令人信服地证明我们提出的模型在渲染非结构化场景时的性能和鲁棒性。我们的代码可在https://github.com/WangXu-xxx/DrivingEditor上获得。
{"title":"DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes","authors":"Wang Xu;Yeqiang Qian;Yun-Fu Liu;Lei Tuo;Huiyong Chen;Ming Yang","doi":"10.1109/TIP.2026.3659733","DOIUrl":"10.1109/TIP.2026.3659733","url":null,"abstract":"In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers’ attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at <uri>https://github.com/WangXu-xxx/DrivingEditor</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1696-1710"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Positional Encoding Image Prior 位置编码图像先验。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3653206
Nimrod Shabtay;Eli Schwartz;Raja Giryes
In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN’s internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme “Positional Encoding Image Prior” (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP
在深度图像先验(DIP)中,卷积神经网络(CNN)拟合将潜在空间映射到退化(例如噪声)图像,但在此过程中学习重建干净图像。这种现象归因于CNN的内部图像先验。我们重新审视DIP框架,从神经隐式表示的角度来考察它。基于这一观点,我们用傅里叶特征(位置编码)代替了随机潜函数。我们通过经验证明,由于傅里叶特征的性质,DIP中的卷积层可以被简单的像素级mlp所取代。我们还证明了它们在线性网络的情况下是等价的。我们将我们的方案命名为“位置编码图像先验”(PIP),并证明它在各种图像重建任务上的性能与DIP非常相似,参数要少得多。此外,我们证明PIP可以很容易地扩展到视频,这是一个基于图像先验和某些INR方法面临稳定性挑战的领域。所有任务的代码和其他示例,包括视频,都可以在项目页面nimrodshabay .github.io/PIP上获得。
{"title":"Positional Encoding Image Prior","authors":"Nimrod Shabtay;Eli Schwartz;Raja Giryes","doi":"10.1109/TIP.2026.3653206","DOIUrl":"10.1109/TIP.2026.3653206","url":null,"abstract":"In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN’s internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme “Positional Encoding Image Prior” (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page <monospace>nimrodshabtay.github.io/PIP</monospace>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2110-2121"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1