The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.
{"title":"Imbalanced Multiclassification Challenges in Whole Slide Image: Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning With Dynamic Rebalancing","authors":"Yonghuang Wu;Xuan Xie;Chengqian Zhao;Pengfei Song;Feiyu Yin;Guoqing Wu;Jinhua Yu","doi":"10.1109/TIP.2026.3654402","DOIUrl":"10.1109/TIP.2026.3654402","url":null,"abstract":"The multi-classification of histopathological images under imbalanced sample conditions remains a long-standing unresolved challenge in computational pathology. In this paper, we propose for the first time a cross-patient pseudo-bag generation technique to address this challenge. Our key innovation lies in a cross-patient pseudo-bag generation framework that extracts complementary pathological features to construct distributionally consistent pseudo-bags. To resolve the critical challenge of distributional alignment in pseudo-bag generation, we propose an affinity-driven curriculum contrastive learning strategy, integrating sample affinity metrics with progressive training to stabilize representation learning. Unlike prior methods focused on bag-level embeddings, our framework pioneers a paradigm shift toward multi-instance feature distribution mining, explicitly modeling inter-bag heterogeneity to address class imbalance. Our method demonstrates significant performance improvements on three datasets with multiple classification difficulties, outperforming the second-best method by an average of 1.95 percentage points in F1 score and 2.07 percentage points in ACC.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"904-914"},"PeriodicalIF":13.7,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/TIP.2025.3650052
Yuming Yang;Wei Wang
Multi-exposure image fusion (MEF) is the main method to obtain High Dynamic Range (HDR) images by fusing multiple images taken under various exposure values. In this paper, we propose and develop a novel variational model based on detail-base decomposition for MEF. The main idea is to incorporate the decomposition procedure and the reconstruction procedure into a unified framework, and to interact the detail information and the base information at the same time. Specifically, we make use of Tikhonov regularization to model the base layer, and we present an efficient design to obtain the detail layer, which is able to capture more detailed information effectively. Meanwhile, we incorporate multi-scale techniques to remove halo artifacts. Numerically, we apply alternating direction method of multipliers (ADMM) to solve the proposed minimization problem. Theoretically, we study the existence of the solution of the proposed model and the convergence of the proposed ADMM algorithm. Experimental examples are presented to demonstrate that the performance of the proposed model is better than that by using other testing methods in terms of visual quality and some criteria, e. g., the proposed model gives the best Natural image quality evaluator (NIQE) values with 1% - 10% improvement for real image fusion experiments and gives the best PSNR values with 13% - 20% improvement for the synthetic image fusion experiment.
{"title":"A Variational Multi-Scale Model for Multi-Exposure Image Fusion","authors":"Yuming Yang;Wei Wang","doi":"10.1109/TIP.2025.3650052","DOIUrl":"10.1109/TIP.2025.3650052","url":null,"abstract":"Multi-exposure image fusion (MEF) is the main method to obtain High Dynamic Range (HDR) images by fusing multiple images taken under various exposure values. In this paper, we propose and develop a novel variational model based on detail-base decomposition for MEF. The main idea is to incorporate the decomposition procedure and the reconstruction procedure into a unified framework, and to interact the detail information and the base information at the same time. Specifically, we make use of Tikhonov regularization to model the base layer, and we present an efficient design to obtain the detail layer, which is able to capture more detailed information effectively. Meanwhile, we incorporate multi-scale techniques to remove halo artifacts. Numerically, we apply alternating direction method of multipliers (ADMM) to solve the proposed minimization problem. Theoretically, we study the existence of the solution of the proposed model and the convergence of the proposed ADMM algorithm. Experimental examples are presented to demonstrate that the performance of the proposed model is better than that by using other testing methods in terms of visual quality and some criteria, e. g., the proposed model gives the best Natural image quality evaluator (NIQE) values with 1% - 10% improvement for real image fusion experiments and gives the best PSNR values with 13% - 20% improvement for the synthetic image fusion experiment.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"701-716"},"PeriodicalIF":13.7,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146000606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TIP.2026.3653189
Xiang Fang;Shihua Zhang;Hao Zhang;Xiaoguang Mei;Huabing Zhou;Jiayi Ma
Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba’s inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba’s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code is publicly available at https://github.com/ShineFox/CorrMamba
{"title":"Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning","authors":"Xiang Fang;Shihua Zhang;Hao Zhang;Xiaoguang Mei;Huabing Zhou;Jiayi Ma","doi":"10.1109/TIP.2026.3653189","DOIUrl":"10.1109/TIP.2026.3653189","url":null,"abstract":"Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba’s inherent selectivity, we propose CorrMamba, a Correspondence filter leveraging Mamba’s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20°, highlighting its practical superiority. Our code is publicly available at <uri>https://github.com/ShineFox/CorrMamba</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"816-829"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing
{"title":"Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs","authors":"Yunxin Li;Zhenyu Liu;Baotian Hu;Wei Wang;Yuxin Ding;Xiaochun Cao;Min Zhang","doi":"10.1109/TIP.2025.3649356","DOIUrl":"10.1109/TIP.2025.3649356","url":null,"abstract":"Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks. The codes will be available at: <uri>https://github.com/HITsz-TMG/MKS2-Multimodal-Knowledge-Storage-and-Sharing</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"858-871"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TIP.2026.3652360
Tao Hu;Longyao Wu;Wei Dong;Peng Wu;Jinqiu Sun;Xiaogang Xu;Qingsen Yan;Yanning Zhang
Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.
{"title":"Boosting HDR Image Reconstruction via Semantic Knowledge Transfer","authors":"Tao Hu;Longyao Wu;Wei Dong;Peng Wu;Jinqiu Sun;Xiaogang Xu;Qingsen Yan;Yanning Zhang","doi":"10.1109/TIP.2026.3652360","DOIUrl":"10.1109/TIP.2026.3652360","url":null,"abstract":"Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images becomes challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1910-1922"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TIP.2026.3653198
Kang Geon Lee;Soochahn Lee;Kyoung Mu Lee
We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.
{"title":"Particle Diffusion Matching: Random Walk Correspondence Search for the Alignment of Standard and Ultra-Widefield Fundus Images","authors":"Kang Geon Lee;Soochahn Lee;Kyoung Mu Lee","doi":"10.1109/TIP.2026.3653198","DOIUrl":"10.1109/TIP.2026.3653198","url":null,"abstract":"We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"943-954"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervised loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image’s power. The performance of our foundation model is validated on nine typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.
{"title":"A Complex-Valued SAR Foundation Model Based on Physically Inspired Representation Learning","authors":"Mengyu Wang;Hanbo Bi;Yingchao Feng;Linlin Xin;Shuo Gong;Tianqi Wang;Zhiyuan Yan;Peijin Wang;Wenhui Diao;Xian Sun","doi":"10.1109/TIP.2026.3652417","DOIUrl":"10.1109/TIP.2026.3652417","url":null,"abstract":"Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervised loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image’s power. The performance of our foundation model is validated on nine typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2094-2109"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1109/TIP.2025.3650668
Ruonan Zhang;Gaoyun An;Yiqing Hao;Dapeng Oliver Wu
Scene Graph Generation (SGG) is a challenging cross-modal task, which aims to identify entities and relationships in a scene simultaneously. Due to the highly skewed long-tailed distribution, the generated scene graphs are dominated by relation categories of head samples. Current works address this problem by designing re-balancing strategies at the data level or refining relation representations at the feature level. Different from them, we attribute this impact to catastrophic interference, that is, the subsequent learning of dominant relations tends to overwrite the earlier learning of rare relations. To address it at the modeling level, a Hippocampal Memory-Like Separation-Completion Collaborative Network (HMSC2) is proposed here, which imitates the hippocampal encoding and retrieval process. Inspired by the pattern separation of dentate gyrus during memory encoding, a Gradient Separation Classifier and a Prototype Separation Learning module are proposed to relieve the catastrophic interference of tail categories by modeling the separated classifier and prototypes. In addition, inspired by the pattern completion of area CA3 of the hippocampus during memory retrieval, a Prototype Completion Module is designed to supplement the incomplete information of prototypes by introducing relation representations as cues. Finally, the completed prototype and relation representations are connected within a hypersphere space by a Contrastive Connected Module. Experimental results on the Visual Genome and GQA datasets show our HMSC2 achieves state-of-the-art performance on the unbiased SGG task, effectively relieving the long-tailed problem. The source codes are released on GitHub: https://github.com/Nora-Zhang98/HMSC2
{"title":"Hippocampal Memory-Like Separation-Completion Collaborative Network for Unbiased Scene Graph Generation","authors":"Ruonan Zhang;Gaoyun An;Yiqing Hao;Dapeng Oliver Wu","doi":"10.1109/TIP.2025.3650668","DOIUrl":"10.1109/TIP.2025.3650668","url":null,"abstract":"Scene Graph Generation (SGG) is a challenging cross-modal task, which aims to identify entities and relationships in a scene simultaneously. Due to the highly skewed long-tailed distribution, the generated scene graphs are dominated by relation categories of head samples. Current works address this problem by designing re-balancing strategies at the data level or refining relation representations at the feature level. Different from them, we attribute this impact to catastrophic interference, that is, the subsequent learning of dominant relations tends to overwrite the earlier learning of rare relations. To address it at the modeling level, a Hippocampal Memory-Like Separation-Completion Collaborative Network (HMSC2) is proposed here, which imitates the hippocampal encoding and retrieval process. Inspired by the pattern separation of dentate gyrus during memory encoding, a Gradient Separation Classifier and a Prototype Separation Learning module are proposed to relieve the catastrophic interference of tail categories by modeling the separated classifier and prototypes. In addition, inspired by the pattern completion of area CA3 of the hippocampus during memory retrieval, a Prototype Completion Module is designed to supplement the incomplete information of prototypes by introducing relation representations as cues. Finally, the completed prototype and relation representations are connected within a hypersphere space by a Contrastive Connected Module. Experimental results on the Visual Genome and GQA datasets show our HMSC2 achieves state-of-the-art performance on the unbiased SGG task, effectively relieving the long-tailed problem. The source codes are released on GitHub: <uri>https://github.com/Nora-Zhang98/HMSC2</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"770-785"},"PeriodicalIF":13.7,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in “track-anything” models have significantly improved fine-grained video understanding by simultaneously handling multiple video segmentation and tracking tasks. However, existing models often struggle with robust and efficient temporal propagation. To address these challenges, we propose the Sparse Spatio-Temporal Propagation (SSTP) method, which achieves robust and efficient unified video segmentation by selectively leveraging key spatio-temporal features in videos. Specifically, we design a dynamic 3D spatio-temporal convolution to aggregate global multi-frame spatio-temporal information into memory frames during memory construction. Additionally, we introduce a spatio-temporal aggregation reading strategy to efficiently aggregate the relevant spatio-temporal features from multiple memory frames during memory retrieval. By combining SSTP with an image segmentation foundation model, such as the segment anything model, our method effectively addresses multiple data-scarce video segmentation tasks. Our experimental results demonstrate state-of-the-art performance on five video segmentation tasks across eleven datasets, outperforming both task-specific and unified methods. Notably, SSTP exhibits strong robustness in handling sparse, low-frame-rate videos, making it well-suited for real-world applications.
{"title":"Fast Track Anything With Sparse Spatio-Temporal Propagation for Unified Video Segmentation","authors":"Jisheng Dang;Huicheng Zheng;Zhixuan Chen;Zhang Li;Yulan Guo;Tat-Seng Chua","doi":"10.1109/TIP.2025.3649365","DOIUrl":"10.1109/TIP.2025.3649365","url":null,"abstract":"Recent advances in “track-anything” models have significantly improved fine-grained video understanding by simultaneously handling multiple video segmentation and tracking tasks. However, existing models often struggle with robust and efficient temporal propagation. To address these challenges, we propose the Sparse Spatio-Temporal Propagation (SSTP) method, which achieves robust and efficient unified video segmentation by selectively leveraging key spatio-temporal features in videos. Specifically, we design a dynamic 3D spatio-temporal convolution to aggregate global multi-frame spatio-temporal information into memory frames during memory construction. Additionally, we introduce a spatio-temporal aggregation reading strategy to efficiently aggregate the relevant spatio-temporal features from multiple memory frames during memory retrieval. By combining SSTP with an image segmentation foundation model, such as the segment anything model, our method effectively addresses multiple data-scarce video segmentation tasks. Our experimental results demonstrate state-of-the-art performance on five video segmentation tasks across eleven datasets, outperforming both task-specific and unified methods. Notably, SSTP exhibits strong robustness in handling sparse, low-frame-rate videos, making it well-suited for real-world applications.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"955-969"},"PeriodicalIF":13.7,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1109/TIP.2025.3648872
Wenjie Li;Heng Guo;Yuefeng Hou;Zhanyu Ma
Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of $times 4$ , while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.
{"title":"FourierSR: A Fourier Token-Based Plugin for Efficient Image Super-Resolution","authors":"Wenjie Li;Heng Guo;Yuefeng Hou;Zhanyu Ma","doi":"10.1109/TIP.2025.3648872","DOIUrl":"10.1109/TIP.2025.3648872","url":null,"abstract":"Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of <inline-formula> <tex-math>$times 4$ </tex-math></inline-formula>, while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"732-742"},"PeriodicalIF":13.7,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}