Recent advances in surgical robotics and computer vision have greatly improved intelligent systems' autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video-text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.
{"title":"Procedure-Aware Hierarchical Alignment for Open Surgery Video-Language Pretraining.","authors":"Boqiang Xu, Jinlin Wu, Jian Liang, Zhenan Sun, Hongbin Liu, Jiebo Luo, Zhen Lei","doi":"10.1109/TIP.2026.3659752","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659752","url":null,"abstract":"<p><p>Recent advances in surgical robotics and computer vision have greatly improved intelligent systems' autonomy and perception in the operating room (OR), especially in endoscopic and minimally invasive surgeries. However, for open surgery, which is still the predominant form of surgical intervention worldwide, there has been relatively limited exploration due to its inherent complexity and the lack of large-scale, diverse datasets. To close this gap, we present OpenSurgery, by far the largest video-text pretraining and evaluation dataset for open surgery understanding. OpenSurgery consists of two subsets: OpenSurgery-Pretrain and OpenSurgery-EVAL. OpenSurgery-Pretrain consists of 843 publicly available open surgery videos for pretraining, spanning 102 hours and encompassing over 20 distinct surgical types. OpenSurgery-EVAL is a benchmark dataset for evaluating model performance in open surgery understanding, comprising 280 training and 120 test videos, totaling 49 hours. Each video in OpenSurgery is meticulously annotated by expert surgeons at three hierarchical levels of video, operation, and frame to ensure both high quality and strong clinical applicability. Next, we propose the Hierarchical Surgical Knowledge Pretraining (HierSKP) framework to facilitate large-scale multimodal representation learning for open surgery understanding. HierSKP leverages a granularity-aware contrastive learning strategy and enhances procedural comprehension by constructing hard negative samples and incorporating a Dynamic Time Warping (DTW)-based loss to capture fine-grained temporal alignment of visual semantics. Extensive experiments show that HierSKP achieves state-of-the-art performance on OpenSurgegy-EVAL across multiple tasks, including operation recognition, temporal action localization, and zero-shot cross-modal retrieval. This demonstrates its strong generalizability for further advances in open surgery understanding.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.
{"title":"Foundation Model Empowered Real-Time Video Conference with Semantic Communications.","authors":"Mingkai Chen, Wenbo Ma, Mujian Zeng, Xiaoming He, Jian Xiong, Lei Wang, Anwer Al-Dulaimi, Shahid Mumtaz","doi":"10.1109/TIP.2026.3659719","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659719","url":null,"abstract":"<p><p>With the development of real-time video conferences, interactive multimedia services have proliferated, leading to a surge in traffic. Interactivity becomes one of the main features on future multimedia services, which brings a new challenge to Computer Vision (CV) for communications. In addition, many directions for CV in video, like recognition, understanding, saliency segmentation, coding, and so on, do not satisfy the demands of the multiple tasks of interactivity without integration. Meanwhile, with the rapid development of the foundation models, we apply task-oriented semantic communications to handle them. Therefore, we propose a novel framework, called Real-Time Video Conference with Foundation Model (RTVCFM), to satisfy the requirement of interactivity in the multimedia service. Firstly, at the transmitter, we perform the causal understanding and spatiotemporal decoupling on interactive videos, with the Video Time-Aware Large Language Model (VTimeLLM), Iterated Integrated Attributions (IIA) and Segment Anything Model 2 (SAM2), to accomplish the video semantic segmentation. Secondly, in the transmission, we propose a two-stage semantic transmission optimization driven by Channel State Information (CSI), which is also suitable for the weights of asymmetric semantic information in real-time video, so that we achieve a low bit rate and high semantic fidelity in the video transmission. Thirdly, at the receiver, RTVCFM provides multidimensional fusion with the whole semantic segmentation by using the Diffusion Model for Foreground Background Fusion (DMFBF), and then we reconstruct the video streams. Finally, the simulation result demonstrates that RTVCFM can achieve a compression ratio as high as 95.6%, while it guarantees high semantic similarity of 98.73% in Multi-Scale Structural Similarity Index Measure (MS-SSIM) and 98.35% in Structural Similarity (SSIM), which shows that the reconstructed video is relatively similar to the original video.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3658010
Hao Yang, Yue Sun, Hui Xie, Lina Zhao, Chi Kin Lam, Qiang Zhao, Xiangyu Xiong, Kunyan Cai, Behdad Dashtbozorg, Chenggang Yan, Tao Tan
The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-ofthe-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM.
{"title":"Anatomy-aware MR-imaging-only Radiotherapy.","authors":"Hao Yang, Yue Sun, Hui Xie, Lina Zhao, Chi Kin Lam, Qiang Zhao, Xiangyu Xiong, Kunyan Cai, Behdad Dashtbozorg, Chenggang Yan, Tao Tan","doi":"10.1109/TIP.2026.3658010","DOIUrl":"https://doi.org/10.1109/TIP.2026.3658010","url":null,"abstract":"<p><p>The synthesis of computed tomography images can supplement electron density information and eliminate MR-CT image registration errors. Consequently, an increasing number of MR-to-CT image translation approaches are being proposed for MR-only radiotherapy planning. However, due to substantial anatomical differences between various regions, traditional approaches often require each model to undergo independent development and use. In this paper, we propose a unified model driven by prompts that dynamically adapt to the different anatomical regions and generates CT images with high structural consistency. Specifically, it utilizes a region-specific attention mechanism, including a region-aware vector and a dynamic gating factor, to achieve MRI-to-CT image translation for multiple anatomical regions. Qualitative and quantitative results on three datasets of anatomical parts demonstrate that our models generate clearer and more anatomically detailed CT images than other state-ofthe-art translation models. The results of the dosimetric analysis also indicate that our proposed model generates images with dose distributions more closely aligned to those of the real CT images. Thus, the proposed model demonstrates promising potential for enabling MR-only radiotherapy across multiple anatomical regions. we have released the source code for our RSAM model. The repository is accessible to the public at: https://github.com/yhyumi123/RSAM.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3659302
Liang Wu, Jianjun Wang, Wei-Shi Zheng, Guangming Shi
Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-p norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729 DNTRKPCA code.
{"title":"Double Nonconvex Tensor Robust Kernel Principal Component Analysis and Its Visual Applications.","authors":"Liang Wu, Jianjun Wang, Wei-Shi Zheng, Guangming Shi","doi":"10.1109/TIP.2026.3659302","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659302","url":null,"abstract":"<p><p>Tensor robust principal component analysis (TRPCA), as a popular linear low-rank method, has been widely applied to various visual tasks. The mathematical process of the low-rank prior is derived from the linear latent variable model. However, for nonlinear tensor data with rich information, their nonlinear structures may break through the assumption of low-rankness and lead to the large approximation error for TRPCA. Motivated by the latent low-dimensionality of nonlinear tensors, the general paradigm of the nonlinear tensor plus sparse tensor decomposition problem, called tensor robust kernel principal component analysis (TRKPCA), is first established in this paper. To efficiently tackle TRKPCA problem, two novel nonconvex regularizers the kernelized tensor Schatten-p norm (KTSPN) and generalized nonconvex regularization are designed, where the former KTSPN with tighter theoretical support adequately captures nonlinear features (i.e., implicit low-rankness) and the latter ensures the sparser structural coding, guaranteeing more robust separation results. Then by integrating their strengths, we propose a double nonconvex TRKPCA (DNTRKPCA) method to achieve our expectation. Finally, we develop an efficient optimization framework via the alternating direction multiplier method (ADMM) to implement the proposed nonconvex kernel method. Experimental results on synthetic data and several real databases show the higher competitiveness of our method compared with other state-of-the-art regularization methods. The code has been released in our ResearchGate homepage: https://www.researchgate.net/publication/397181729 DNTRKPCA code.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3659733
Wang Xu, Yeqiang Qian, Yun-Fu Liu, Lei Tuo, Huiyong Chen, Ming Yang
In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers' attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor.
{"title":"DrivingEditor: 4D Composite Gaussian Splatting for Reconstruction and Edition of Dynamic Autonomous Driving Scenes.","authors":"Wang Xu, Yeqiang Qian, Yun-Fu Liu, Lei Tuo, Huiyong Chen, Ming Yang","doi":"10.1109/TIP.2026.3659733","DOIUrl":"https://doi.org/10.1109/TIP.2026.3659733","url":null,"abstract":"<p><p>In recent years, with the development of autonomous driving, 3D reconstruction for unbounded large-scale scenes has attracted researchers' attention. Existing methods have achieved outstanding reconstruction accuracy in autonomous driving scenes, but most of them lack the ability to edit scenes. Although some methods have the capability to edit scenarios, they are highly dependent on manually annotated 3D bounding boxes, leading to their poor scalability. To address the issues, we introduce a new Gaussian representation, called DrivingEditor, which decouples the scene into two parts and handles them by separate branches to individually model the dynamic foreground objects and the static background during the training process. By proposing a framework for decoupled modeling of scenarios, we can achieve accurate editing of any dynamic target, such as dynamic objects removal, adding and etc, meanwhile improving the reconstruction quality of autonomous driving scenes especially the dynamic foreground objects, without resorting to 3D bounding boxes. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate the performance in 3D reconstruction for both dynamic and static scenes. Besides, we conduct extra experiments on unstructured large-scale scenarios, which can more convincingly demonstrate the performance and robustness of our proposed model when rendering the unstructured scenes. Our code is available at https://github.com/WangXu-xxx/DrivingEditor.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3653206
Nimrod Shabtay, Eli Schwartz, Raja Giryes
In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme "Positional Encoding Image Prior" (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP.
{"title":"Positional Encoding Image Prior.","authors":"Nimrod Shabtay, Eli Schwartz, Raja Giryes","doi":"10.1109/TIP.2026.3653206","DOIUrl":"https://doi.org/10.1109/TIP.2026.3653206","url":null,"abstract":"<p><p>In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme \"Positional Encoding Image Prior\" (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/TIP.2026.3654367
Xiang Fang;Zizhuo Li;Jiayi Ma
Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at https://github.com/ShineFox/SigMa
{"title":"SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching","authors":"Xiang Fang;Zizhuo Li;Jiayi Ma","doi":"10.1109/TIP.2026.3654367","DOIUrl":"10.1109/TIP.2026.3654367","url":null,"abstract":"Recent advancements have led the image matching community to increasingly focus on obtaining subpixel-level correspondences in a detector-free manner, i.e., semi-dense feature matching. Existing methods tend to overfocus on low-level local features while ignoring equally important high-level semantic information. To tackle these shortcomings, we propose SigMa, a semantic similarity-guided semi-dense feature matching method, which leverages the strengths of both local features and high-level semantic features. First, we design a dual-branch feature extractor, comprising a convolutional network and a vision foundation model, to extract low-level local features and high-level semantic features, respectively. To fully retain the advantages of these two features and effectively integrate them, we also introduce a cross-domain feature adapter, which could overcome their spatial resolution mismatches, channel dimensionality variations, and inter-domain gaps. Furthermore, we observe that performing the transformer on the whole feature map is unnecessary because of the similarity of local representations. We design a guided pooling method based on semantic similarity. This strategy performs attention computation by selecting highly semantically similar regions, aiming to minimize information loss while maintaining computational efficiency. Extensive experiments on multiple datasets demonstrate that our method achieves a competitive accuracy-efficiency trade-off across various tasks and exhibits strong generalization capabilities across different datasets. Additionally, we conduct a series of ablation studies and analysis experiments to validate the effectiveness and rationality of our method’s design. Our code is publicly available at <uri>https://github.com/ShineFox/SigMa</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"872-887"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at https://github.com/zqx951102/RPPS
{"title":"Reliable Pseudo-Supervision for Unsupervised Domain Adaptive Person Search","authors":"Qixian Zhang;Duoqian Miao;Qi Zhang;Xuan Tan;Hongyun Zhang;Cairong Zhao","doi":"10.1109/TIP.2026.3654373","DOIUrl":"10.1109/TIP.2026.3654373","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) person search aims to adapt models trained on labeled source data to unlabeled target domains. Existing approaches typically rely on clustering-based proxy learning, but their performance is often undermined by unreliable pseudo-supervision. This unreliability mainly stems from two challenges: (i) spectral shift bias, where low- and high-frequency components behave differently under domain shifts but are rarely considered, degrading feature stability; and (ii) static proxy updates, which make clustering proxies highly sensitive to noise and less adaptable to domain shifts. To address these challenges, we propose the Reliable Pseudo-supervision in UDA Person Search (RPPS) framework. At the feature level, a Dual-branch Wavelet Enhancement Module (DWEM) embedded in the backbone applies discrete wavelet transform (DWT) to decompose features into low- and high-frequency components, followed by differentiated enhancements that improve cross-domain robustness and discriminability. At the proxy level, a Dynamic Confidence-weighted Clustering Proxy (DCCP) employs confidence-guided initialization and a two-stage online–offline update strategy to stabilize proxy optimization and suppress proxy noise. Extensive experiments on the CUHK-SYSU and PRW benchmarks demonstrate that RPPS achieves state-of-the-art performance and strong robustness, underscoring the importance of enhancing pseudo-supervision reliability in UDA person search. Our code is accessible at <uri>https://github.com/zqx951102/RPPS</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"915-929"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/TIP.2026.3654473
Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.
{"title":"Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts","authors":"Zhong Ji;Rongshuai Wei;Jingren Liu;Yanwei Pang;Jungong Han","doi":"10.1109/TIP.2026.3654473","DOIUrl":"10.1109/TIP.2026.3654473","url":null,"abstract":"Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance. To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module. Meanwhile, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation patterns. In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability. Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries. Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%–8.7% relative gains in 5-way 5-shot classification. These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"930-942"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.
{"title":"Imbalanced Multiclassification Challenges in Whole Slide Image: Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning With Dynamic Rebalancing","authors":"Yonghuang Wu;Xuan Xie;Chengqian Zhao;Pengfei Song;Feiyu Yin;Guoqing Wu;Jinhua Yu","doi":"10.1109/TIP.2026.3654402","DOIUrl":"10.1109/TIP.2026.3654402","url":null,"abstract":"The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"904-914"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}