首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining. 开放手术视频语言预训练的程序感知分层对齐。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659752
Boqiang Xu, Jinlin Wu, Jian Liang, Zhenan Sun, Hongbin Liu, Jiebo Luo, Zhen Lei

Recent advances in surgical robotics and computer vision have greatly improved intelligent systems' autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video-text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.

外科机器人技术和计算机视觉的最新进展极大地提高了智能系统在手术室(OR)中的自主性和感知能力,特别是在内窥镜和微创手术中。然而,开放手术仍然是世界范围内主要的手术干预形式,由于其固有的复杂性和缺乏大规模、多样化的数据集,其探索相对有限。为了缩小这一差距,我们提出了OpenSurgery,这是迄今为止最大的用于开放手术理解的视频文本预训练和评估数据集。OpenSurgery包括两个子集:OpenSurgery- pretrain和OpenSurgery- eval。OpenSurgery-Pretrain由843个公开的开放式手术视频组成,用于预训练,跨越102小时,涵盖20多种不同的手术类型。OpenSurgery-EVAL是用于评估开放手术理解模型性能的基准数据集,包括280个训练视频和120个测试视频,总计49小时。OpenSurgery的每一个视频都由专家医生从视频、操作、帧三个层次进行精心注释,保证了高质量和较强的临床适用性。接下来,我们提出了分层外科知识预训练(HierSKP)框架,以促进开放手术理解的大规模多模态表示学习。HierSKP利用粒度感知的对比学习策略,通过构建硬负样本和结合基于动态时间扭曲(DTW)的损失来捕获视觉语义的细粒度时间对齐,从而增强程序理解。大量实验表明,HierSKP在opensurgical - eval上实现了最先进的多任务性能,包括操作识别、时间动作定位和零射击跨模态检索。这证明了它对进一步推进开放手术的理解具有很强的通用性。
{"title":"Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining.","authors":"Boqiang Xu, Jinlin Wu, Jian Liang, Zhenan Sun, Hongbin Liu, Jiebo Luo, Zhen Lei","doi":"10.1109/TIP.2026.3659752","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659752","url":null,"abstract":"<p><p>Recent advances in surgical robotics and computer vision have greatly improved intelligent systems' autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video-text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Foundation Model Empowered Real-Time Video Conference with Semantic Communications. 基础模型支持语义通信的实时视频会议。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659719
Mingkai Chen, Wenbo Ma, Mujian Zeng, Xiaoming He, Jian Xiong, Lei Wang, Anwer Al-Dulaimi, Shahid Mumtaz

With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.

随着实时视频会议的发展,交互式多媒体业务激增,导致业务流量激增。交互性将成为未来多媒体服务的主要特征之一,这对通信领域的计算机视觉技术提出了新的挑战。此外,视频中CV的许多方向,如识别、理解、显著性分割、编码等,如果没有整合,就不能满足交互多任务的需求。同时,随着基础模型的快速发展,我们采用面向任务的语义通信来处理它们。为此,我们提出了一种基于基础模型的实时视频会议(RTVCFM)框架,以满足多媒体业务的交互性需求。首先,在发送端,我们利用视频时间感知大语言模型(VTimeLLM)、迭代集成属性(IIA)和片段任意模型2 (SAM2)对交互视频进行因果理解和时空解耦,完成视频语义分割。其次,在传输中,我们提出了一种由信道状态信息(CSI)驱动的两阶段语义传输优化,该优化同样适用于实时视频中不对称语义信息的权重,从而在视频传输中实现低比特率和高语义保真度。第三,在接收端,RTVCFM利用前景背景融合扩散模型(DMFBF)对视频流进行全语义分割的多维融合,重构视频流;最后,仿真结果表明,RTVCFM可以实现高达95.6%的压缩比,同时保证了多尺度结构相似度指标(MS-SSIM)和结构相似度指标(SSIM)的98.73%和98.35%的高语义相似度,表明重构视频与原始视频相对相似。
{"title":"Foundation Model Empowered Real-Time Video Conference with Semantic Communications.","authors":"Mingkai Chen, Wenbo Ma, Mujian Zeng, Xiaoming He, Jian Xiong, Lei Wang, Anwer Al-Dulaimi, Shahid Mumtaz","doi":"10.1109/TIP.2026.3659719","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659719","url":null,"abstract":"<p><p>With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anatomy-aware MR-imaging-only Radiotherapy. 解剖意识mr成像放射治疗。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3658010
Hao Yang, Yue Sun, Hui Xie, Lina Zhao, Chi Kin Lam, Qiang Zhao, Xiangyu Xiong, Kunyan Cai, Behdad Dashtbozorg, Chenggang Yan, Tao Tan

The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-ofthe-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM.

计算机断层图像的合成可以补充电子密度信息,消除核磁共振ct图像配准误差。因此,越来越多的MR-to-CT图像转换方法被提出用于MR-only放疗计划。然而,由于各区域解剖结构的巨大差异,传统方法往往需要每个模型进行独立的开发和使用。在本文中,我们提出了一个由提示符驱动的统一模型,该模型可以动态适应不同的解剖区域,并生成具有高结构一致性的CT图像。具体来说,它利用特定区域的注意机制,包括一个区域感知向量和一个动态门控因子,来实现多个解剖区域的mri到ct图像转换。在三个解剖部位数据集上的定性和定量结果表明,我们的模型比其他最先进的翻译模型产生更清晰、更详细的解剖CT图像。剂量学分析的结果也表明,我们提出的模型产生的图像的剂量分布更接近于真实的CT图像。因此,所提出的模型显示了跨多个解剖区域实现仅磁共振放射治疗的潜力。我们已经发布了RSAM模型的源代码。该存储库可供公众访问:https://github.com/yhyumi123/RSAM。
{"title":"Anatomy-aware MR-imaging-only Radiotherapy.","authors":"Hao Yang, Yue Sun, Hui Xie, Lina Zhao, Chi Kin Lam, Qiang Zhao, Xiangyu Xiong, Kunyan Cai, Behdad Dashtbozorg, Chenggang Yan, Tao Tan","doi":"10.1109/TIP.2026.3658010","DOIUrl":"https://doi.org/10.1109/TIP.2026.3658010","url":null,"abstract":"<p><p>The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-ofthe-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications. 双非凸张量鲁棒核主成分分析及其可视化应用。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659302
Liang Wu, Jianjun Wang, Wei-Shi Zheng, Guangming Shi

Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-p norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729 DNTRKPCA code.

张量鲁棒主成分分析(TRPCA)作为一种流行的线性低秩方法,已广泛应用于各种视觉任务。从线性潜变量模型出发,推导了低秩先验的数学过程。然而,对于信息丰富的非线性张量数据,其非线性结构可能会突破低秩假设,导致TRPCA的近似误差较大。基于非线性张量的潜在低维性,本文首次建立了非线性张量加稀疏张量分解问题的一般范式——张量鲁棒核主成分分析(TRKPCA)。为了有效地解决TRKPCA问题,设计了核化张量schattenp范数(kernel - tensor schattenp norm, KTSPN)和广义非凸正则化两种新的非凸正则化方法,其中KTSPN具有更强的理论支持,能够充分捕获非线性特征(即隐式低秩),而广义非凸正则化则保证了更稀疏的结构编码,保证了分离结果的鲁棒性。然后,通过整合它们的优势,我们提出了一种双非凸TRKPCA (DNTRKPCA)方法来实现我们的期望。最后,我们通过交替方向乘子法(ADMM)开发了一个有效的优化框架来实现所提出的非凸核方法。在合成数据和几个真实数据库上的实验结果表明,与其他最先进的正则化方法相比,我们的方法具有更高的竞争力。该代码已在我们的ResearchGate主页上发布:https://www.researchgate.net/publication/397181729 DNTRKPCA代码。
{"title":"Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications.","authors":"Liang Wu, Jianjun Wang, Wei-Shi Zheng, Guangming Shi","doi":"10.1109/TIP.2026.3659302","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659302","url":null,"abstract":"<p><p>Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-p norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729 DNTRKPCA code.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes. DrivingEditor:用于动态自动驾驶场景重建和编辑的4D复合高斯飞溅。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3659733
Wang Xu, Yeqiang Qian, Yun-Fu Liu, Lei Tuo, Huiyong Chen, Ming Yang

In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers' attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor.

近年来,随着自动驾驶技术的发展,无界大场景的三维重建受到了研究人员的关注。现有方法在自动驾驶场景中实现了出色的重建精度,但大多缺乏场景编辑能力。虽然一些方法有编辑场景的能力,但它们高度依赖于手动注释的3D边界框,导致它们的可扩展性很差。为了解决这些问题,我们引入了一种新的高斯表示,称为DrivingEditor,它将场景解耦为两个部分,并通过单独的分支来处理它们,以便在训练过程中单独建模动态前景对象和静态背景。通过提出一种场景解耦建模框架,我们可以实现对任意动态目标的精确编辑,如动态对象的移除、添加等,同时提高自动驾驶场景尤其是动态前景对象的重建质量,而无需借助3D边界框。在Waymo开放数据集和KITTI基准测试上的大量实验证明了动态和静态场景的3D重建性能。此外,我们还对非结构化的大规模场景进行了额外的实验,这可以更令人信服地证明我们提出的模型在渲染非结构化场景时的性能和鲁棒性。我们的代码可在https://github.com/WangXu-xxx/DrivingEditor上获得。
{"title":"DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes.","authors":"Wang Xu, Yeqiang Qian, Yun-Fu Liu, Lei Tuo, Huiyong Chen, Ming Yang","doi":"10.1109/TIP.2026.3659733","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659733","url":null,"abstract":"<p><p>In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers' attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Positional Encoding Image Prior. 位置编码图像先验。
IF 13.7 Pub Date : 2026-02-06 DOI: 10.1109/TIP.2026.3653206
Nimrod Shabtay, Eli Schwartz, Raja Giryes

In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme "Positional Encoding Image Prior" (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP.

在深度图像先验(DIP)中,卷积神经网络(CNN)拟合将潜在空间映射到退化(例如噪声)图像,但在此过程中学习重建干净图像。这种现象归因于CNN的内部图像先验。我们重新审视DIP框架,从神经隐式表示的角度来考察它。基于这一观点,我们用傅里叶特征(位置编码)代替了随机潜函数。我们通过经验证明,由于傅里叶特征的性质,DIP中的卷积层可以被简单的像素级mlp所取代。我们还证明了它们在线性网络的情况下是等价的。我们将我们的方案命名为“位置编码图像先验”(PIP),并证明它在各种图像重建任务上的性能与DIP非常相似,参数要少得多。此外,我们证明PIP可以很容易地扩展到视频,这是一个基于图像先验和某些INR方法面临稳定性挑战的领域。所有任务的代码和其他示例,包括视频,都可以在项目页面nimrodshabay .github.io/PIP上获得。
{"title":"Positional Encoding Image Prior.","authors":"Nimrod Shabtay, Eli Schwartz, Raja Giryes","doi":"10.1109/TIP.2026.3653206","DOIUrl":"https://doi.org/10.1109/TIP.2026.3653206","url":null,"abstract":"<p><p>In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme \"Positional Encoding Image Prior\" (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching 语义相似度引导的半密集特征匹配。
IF 13.7 Pub Date : 2026-01-21 DOI: 10.1109/TIP.2026.3654367
Xiang Fang;Zizhuo Li;Jiayi Ma
Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at https://github.com/ShineFox/SigMa
最近的进展使得图像匹配界越来越关注于以无检测器的方式获得亚像素级对应,即半密集特征匹配。现有的方法往往过于关注低级的局部特征,而忽略了同样重要的高级语义信息。为了解决这些问题,我们提出了一种基于语义相似度的半密集特征匹配方法SigMa,该方法同时利用了局部特征和高级语义特征的优势。首先,我们设计了一个双分支特征提取器,包括卷积网络和视觉基础模型,分别提取低级局部特征和高级语义特征。为了充分保留这两种特征的优势并有效地将它们整合,我们还引入了一种跨域特征适配器,以克服它们的空间分辨率不匹配、信道维度变化和域间间隙。此外,我们观察到,由于局部表示的相似性,在整个特征映射上执行转换是不必要的。设计了一种基于语义相似度的引导池化方法。该策略通过选择语义高度相似的区域进行注意力计算,在保持计算效率的同时最小化信息损失。在多个数据集上的大量实验表明,我们的方法在各种任务之间实现了具有竞争力的准确性和效率权衡,并在不同数据集上表现出强大的泛化能力。此外,我们进行了一系列的烧蚀研究和分析实验,以验证我们的方法设计的有效性和合理性。我们的代码将是公开的。
{"title":"SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching","authors":"Xiang Fang;Zizhuo Li;Jiayi Ma","doi":"10.1109/TIP.2026.3654367","DOIUrl":"10.1109/TIP.2026.3654367","url":null,"abstract":"Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at <uri>https://github.com/ShineFox/SigMa</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"872-887"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliable Pseudo-Supervision for Unsupervised Domain Adaptive Person Search 无监督域自适应人搜索的可靠伪监督。
IF 13.7 Pub Date : 2026-01-21 DOI: 10.1109/TIP.2026.3654373
Qixian Zhang;Duoqian Miao;Qi Zhang;Xuan Tan;Hongyun Zhang;Cairong Zhao
Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at https://github.com/zqx951102/RPPS
无监督域自适应(UDA)人员搜索的目的是将经过标记的源数据训练的模型适应于未标记的目标域。现有的方法通常依赖于基于聚类的代理学习,但它们的性能往往受到不可靠的伪监督的影响。这种不可靠性主要来自两个方面的挑战:(i)频谱移位偏差,其中低频和高频分量在域移位下表现不同,但很少被考虑,从而降低了特征稳定性;(ii)静态代理更新,这使得聚类代理对噪声高度敏感,对域转移的适应性较差。为了解决这些问题,我们提出了UDA人员搜索(RPPS)框架中的可靠伪监督。在特征级,嵌入在主干中的双分支小波增强模块(DWEM)应用离散小波变换(DWT)将特征分解为低频和高频分量,然后进行差异化增强,提高跨域鲁棒性和可判别性。在代理层面,动态置信度加权聚类代理(DCCP)采用置信度引导初始化和两阶段在线-离线更新策略来稳定代理优化并抑制代理噪声。在中大- sysu和PRW基准上的大量实验表明,RPPS达到了最先进的性能和强大的鲁棒性,强调了提高UDA人员搜索中伪监督可靠性的重要性。我们的代码可以在https://github.com/zqx951102/RPPS上访问。
{"title":"Reliable Pseudo-Supervision for Unsupervised Domain Adaptive Person Search","authors":"Qixian Zhang;Duoqian Miao;Qi Zhang;Xuan Tan;Hongyun Zhang;Cairong Zhao","doi":"10.1109/TIP.2026.3654373","DOIUrl":"10.1109/TIP.2026.3654373","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at <uri>https://github.com/zqx951102/RPPS</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"915-929"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts 基于原型概念引导的LoRA专家混合可解释的少量图像分类。
IF 13.7 Pub Date : 2026-01-21 DOI: 10.1109/TIP.2026.3654473
Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.
自我解释模型(SEMs)依赖于原型概念学习(PCL)来使其视觉识别过程更具可解释性,但它们经常在数据稀缺的环境中挣扎,在这些环境中,训练样本不足会导致次优性能。为了解决这一限制,我们提出了一个少镜头原型概念分类(FSPCC)框架,该框架系统地减轻了低数据制度下的两个关键挑战:参数失衡和表示错位。具体来说,我们的方法利用LoRA专家(MoLE)的混合物进行参数高效适应,确保在主干网和PCL模块之间平衡分配可训练参数。同时,跨模块的概念引导强制骨干的特征表示和原型概念激活模式之间的紧密对齐。此外,我们还采用了一种多层次的特征保存策略,融合了不同层的空间和语义线索,从而丰富了学习到的表征,减轻了数据可用性有限带来的挑战。最后,为了增强可解释性和最小化概念重叠,我们引入了一个几何感知的概念区分损失,强制概念之间的正交性,鼓励更多的解纠缠和透明的决策边界。在六个流行的基准测试(CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft和DTD)上的实验结果表明,我们的方法始终优于现有的sem,在5-way 5-shot分类中具有4.2%-8.7%的相对增益。这些发现强调了将概念学习与少镜头自适应相结合的有效性,以实现更高的准确性和更清晰的模型可解释性,为更透明的视觉识别系统铺平了道路。
{"title":"Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts","authors":"Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han","doi":"10.1109/TIP.2026.3654473","DOIUrl":"10.1109/TIP.2026.3654473","url":null,"abstract":"Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"930-942"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Imbalanced Multiclassification Challenges in Whole Slide Image: Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning With Dynamic Rebalancing 全幻灯片图像的不平衡多分类挑战:跨患者伪袋生成与动态再平衡课程对比学习。
IF 13.7 Pub Date : 2026-01-21 DOI: 10.1109/TIP.2026.3654402
Yonghuang Wu;Xuan Xie;Chengqian Zhao;Pengfei Song;Feiyu Yin;Guoqing Wu;Jinhua Yu
The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.
在不平衡样本条件下组织病理图像的多重分类仍然是计算病理学中长期未解决的挑战。在本文中,我们首次提出了一种跨患者伪袋生成技术来解决这一挑战。我们的关键创新在于一个跨患者伪袋生成框架,提取互补的病理特征来构建分布一致的伪袋。为了解决伪袋生成中分布对齐的关键挑战,我们提出了一种亲和力驱动的课程对比学习策略,将样本亲和力指标与渐进式训练相结合,以稳定表征学习。与之前专注于袋级嵌入的方法不同,我们的框架开创了向多实例特征分布挖掘的范式转变,明确地建模袋间异质性以解决类不平衡问题。我们的方法在具有多个分类困难的三个数据集上表现出显著的性能改进,在F1得分和ACC得分上平均比第二优方法高出1.95个百分点和2.07个百分点。
{"title":"Imbalanced Multiclassification Challenges in Whole Slide Image: Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning With Dynamic Rebalancing","authors":"Yonghuang Wu;Xuan Xie;Chengqian Zhao;Pengfei Song;Feiyu Yin;Guoqing Wu;Jinhua Yu","doi":"10.1109/TIP.2026.3654402","DOIUrl":"10.1109/TIP.2026.3654402","url":null,"abstract":"The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"904-914"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1