Pub Date : 2026-02-16DOI: 10.1109/TIP.2026.3663039
Yili Ren;Jinyang Du;Xi Liu;Qianxiao Su;Yue Deng;Hongjue Li
Fine-grained visual referring and grounding are critical for enhancing scene understanding and enabling various real-world vision-language applications. Although recent studies have extended multimodal large language models (MLLMs) to these tasks, they still face significant challenges in fine-grained multi-target scenarios. To address this, we propose MTRAG, a pixel-level multi-target referring and grounding framework that leverages semantic-spatial collaboration. Specifically, we introduce a Channel Extension Mechanism (CEM) that enables a global image encoder to extract global semantics and multi-region representations while retaining background context, without extra region feature extractors. Moreover, we introduce a grounding branch for pixel-level grounding and design a Hybrid Adapter (HA) to fuse semantic features from the MLLM branch with spatial information from the grounding branch, thereby enhancing the semantic-spatial alignment. For training, we meticulously curate MTRAG-D, a dataset comprising single- and multi-target referring and grounding samples derived from existing datasets and newly synthesized free-form multi-target referring instruction-following data. We also present MTR-Bench, a benchmark for systematic evaluation of multi-target referring. Extensive experiments across five core tasks, including single- and multi-target referring and grounding as well as image-level captioning, show that MTRAG consistently outperforms strong baselines on both multi- and single-target tasks, while maintaining competitive image-level understanding. The code is available at https://github.com/deng-ai-lab/MTRAG
{"title":"MTRAG: Multi-Target Referring and Grounding via Hybrid Semantic-Spatial Integration","authors":"Yili Ren;Jinyang Du;Xi Liu;Qianxiao Su;Yue Deng;Hongjue Li","doi":"10.1109/TIP.2026.3663039","DOIUrl":"10.1109/TIP.2026.3663039","url":null,"abstract":"Fine-grained visual referring and grounding are critical for enhancing scene understanding and enabling various real-world vision-language applications. Although recent studies have extended multimodal large language models (MLLMs) to these tasks, they still face significant challenges in fine-grained multi-target scenarios. To address this, we propose MTRAG, a pixel-level multi-target referring and grounding framework that leverages semantic-spatial collaboration. Specifically, we introduce a Channel Extension Mechanism (CEM) that enables a global image encoder to extract global semantics and multi-region representations while retaining background context, without extra region feature extractors. Moreover, we introduce a grounding branch for pixel-level grounding and design a Hybrid Adapter (HA) to fuse semantic features from the MLLM branch with spatial information from the grounding branch, thereby enhancing the semantic-spatial alignment. For training, we meticulously curate MTRAG-D, a dataset comprising single- and multi-target referring and grounding samples derived from existing datasets and newly synthesized free-form multi-target referring instruction-following data. We also present MTR-Bench, a benchmark for systematic evaluation of multi-target referring. Extensive experiments across five core tasks, including single- and multi-target referring and grounding as well as image-level captioning, show that MTRAG consistently outperforms strong baselines on both multi- and single-target tasks, while maintaining competitive image-level understanding. The code is available at <uri>https://github.com/deng-ai-lab/MTRAG</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2167-2181"},"PeriodicalIF":13.7,"publicationDate":"2026-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-16DOI: 10.1109/TIP.2026.3662596
Yixin Zhu;Long Lv;Pingping Zhang;Xuehu Liu;Tongdan Tang;Feng Tian;Weibing Sun;Huchuan Lu
Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at https://github.com/Namn23/ISFM.
{"title":"Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion","authors":"Yixin Zhu;Long Lv;Pingping Zhang;Xuehu Liu;Tongdan Tang;Feng Tian;Weibing Sun;Huchuan Lu","doi":"10.1109/TIP.2026.3662596","DOIUrl":"10.1109/TIP.2026.3662596","url":null,"abstract":"Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at <uri>https://github.com/Namn23/ISFM</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2380-2392"},"PeriodicalIF":13.7,"publicationDate":"2026-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-13DOI: 10.1109/TIP.2026.3662583
Ke Jiang;Xinya Ji;Baoshun Shi
In the past few years, group-based sparse representation (GSR) has emerged as a powerful paradigm for image inverse problems by synergizing model-driven interpretability with nonlocal self-similarity priors. Nevertheless, its practical utility is hindered by computationally expensive iterative processes. Deep learning (DL) methods can avoid this deficiency, but they often lack of model interpretability. To bridge this gap, we propose a novel deep group-based sparse representation framework, termed DeepGSR, which brings the GSR method and the DL approach together. DeepGSR not only circumvents the iterative bottlenecks of conventional GSR but also preserves its model interpretability through a learnable parameterization. Specifically, the network is built upon a GSR model that leverages nonlocal self-similarity, and it integrates adaptive patch matching and aggregation mechanisms to model complex intra-group relationships in the latent space. To reduce the computational complexity associated with traditional SVD-based rank shrinkage, we introduce a learnable low-rank shrinkage module that incorporates low-rank constraints while enhancing the interpretability and adaptability of the model. To better exploit frequency-specific structures, the network incorporates a shifting wavelet-domain patch partitioning strategy, which separately models high- and low-frequency components to further enhance the representation ability of the network. Extensive experiments demonstrate that DeepGSR, when applied as a drop-in replacement module to various image inverse problems such as image denoising, single-image deraining, metal artifact reduction, sparse-view computed tomography reconstruction, phase retrieval, and all-in-one image restoration consistently delivers effective performance, validating the effectiveness of the proposed framework. The source code and datasets have been made publicly available at https://github.com/shibaoshun/DeepGSR
{"title":"DeepGSR: Deep Group-Based Sparse Representation Network for Solving Image Inverse Problems","authors":"Ke Jiang;Xinya Ji;Baoshun Shi","doi":"10.1109/TIP.2026.3662583","DOIUrl":"10.1109/TIP.2026.3662583","url":null,"abstract":"In the past few years, group-based sparse representation (GSR) has emerged as a powerful paradigm for image inverse problems by synergizing model-driven interpretability with nonlocal self-similarity priors. Nevertheless, its practical utility is hindered by computationally expensive iterative processes. Deep learning (DL) methods can avoid this deficiency, but they often lack of model interpretability. To bridge this gap, we propose a novel deep group-based sparse representation framework, termed DeepGSR, which brings the GSR method and the DL approach together. DeepGSR not only circumvents the iterative bottlenecks of conventional GSR but also preserves its model interpretability through a learnable parameterization. Specifically, the network is built upon a GSR model that leverages nonlocal self-similarity, and it integrates adaptive patch matching and aggregation mechanisms to model complex intra-group relationships in the latent space. To reduce the computational complexity associated with traditional SVD-based rank shrinkage, we introduce a learnable low-rank shrinkage module that incorporates low-rank constraints while enhancing the interpretability and adaptability of the model. To better exploit frequency-specific structures, the network incorporates a shifting wavelet-domain patch partitioning strategy, which separately models high- and low-frequency components to further enhance the representation ability of the network. Extensive experiments demonstrate that DeepGSR, when applied as a drop-in replacement module to various image inverse problems such as image denoising, single-image deraining, metal artifact reduction, sparse-view computed tomography reconstruction, phase retrieval, and all-in-one image restoration consistently delivers effective performance, validating the effectiveness of the proposed framework. The source code and datasets have been made publicly available at <uri>https://github.com/shibaoshun/DeepGSR</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2454-2469"},"PeriodicalIF":13.7,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146195774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tracking-by-Detection paradigms shine in generic multi-object tracking (MOT), while their compact construction hinders the real-time applications. In this work, we attribute the substantial computational burden to two expensive components, i.e. detection and re-identification. Building upon the principle of adaptively maintaining acceptable inference efficiency, we present Adaptively Sparse Detection with attention-guided refinement (ASDTracker) for efficient tracking. In specific, our ASDTracker rapidly assess the short-term and long-term occlusion, dynamically determining the usage of the expensive detector. For non-key frames, we efficiently refine small-size crops out of Kalman Filter predictions and introduce the noisy shadow labels to robustly train this refinement network. Additionally, we substitute the lightweight appearance representation for the heavy ReID network, which efficiently extracts sufficient appearance cues in the coarsely quantized color spaces. Extensive experiments on four benchmarks demonstrate that ASDTracker achieves competitive performance in generalization and robustness under favorable inference speed. Moreover, the efficient tracking deployment is further implemented to an unmanned surface vehicle with high accuracy and low latency in real-world scenarios.
{"title":"ASDTracker: Adaptively Sparse Detection With Attention-Guided Refinement for Efficient Multi-Object Tracking","authors":"Yueying Wang;Chenyang Yan;Cairong Zhao;Weidong Zhang;Dan Zeng","doi":"10.1109/TIP.2026.3662594","DOIUrl":"10.1109/TIP.2026.3662594","url":null,"abstract":"Tracking-by-Detection paradigms shine in generic multi-object tracking (MOT), while their compact construction hinders the real-time applications. In this work, we attribute the substantial computational burden to two expensive components, i.e. detection and re-identification. Building upon the principle of adaptively maintaining acceptable inference efficiency, we present Adaptively Sparse Detection with attention-guided refinement (ASDTracker) for efficient tracking. In specific, our ASDTracker rapidly assess the short-term and long-term occlusion, dynamically determining the usage of the expensive detector. For non-key frames, we efficiently refine small-size crops out of Kalman Filter predictions and introduce the noisy shadow labels to robustly train this refinement network. Additionally, we substitute the lightweight appearance representation for the heavy ReID network, which efficiently extracts sufficient appearance cues in the coarsely quantized color spaces. Extensive experiments on four benchmarks demonstrate that ASDTracker achieves competitive performance in generalization and robustness under favorable inference speed. Moreover, the efficient tracking deployment is further implemented to an unmanned surface vehicle with high accuracy and low latency in real-world scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1993-2006"},"PeriodicalIF":13.7,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146195486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the loss of 3D information, accurate and robust 2D image feature matching remains challenging for many computer vision applications. This paper introduces a 2.5D feature that uses the disparity value from the light field Fourier disparity layer (FDL) as a rough proxy of scene depth. Without explicit depth estimation, a parameterized depth-degraded projection is proposed to construct the geometric transformation of paired features between two light fields. Then, we propose a parameterized learning solution to calculate the depth-degraded projection. This solution estimates a global constant fundamental matrix, a variable disparity-guided translation vector, and a depth compensation term using a very simple network. Although the 0.5D relative disparity provided by the FDL does not represent precise depth, it can also significantly reduce the depth ambiguity in feature matching. Therefore, the proposed solution achieves accurate feature-matching results by minimizing the sum of reprojection errors across all matching candidates. On the public light field feature-matching dataset, the proposed solution outperforms existing 2D image feature-matching solutions and light field feature-matching algorithms in terms of matching accuracy and robustness. The code is available online.
{"title":"Robust 2.5D Feature Matching in Light Fields via a Learnable Parameterized Depth-Degraded Projection","authors":"Meng Zhang;Haiyan Jin;Zhaolin Xiao;Jinglei Shi;Xiaoran Jiang","doi":"10.1109/TIP.2026.3662579","DOIUrl":"10.1109/TIP.2026.3662579","url":null,"abstract":"Due to the loss of 3D information, accurate and robust 2D image feature matching remains challenging for many computer vision applications. This paper introduces a 2.5D feature that uses the disparity value from the light field Fourier disparity layer (FDL) as a rough proxy of scene depth. Without explicit depth estimation, a parameterized depth-degraded projection is proposed to construct the geometric transformation of paired features between two light fields. Then, we propose a parameterized learning solution to calculate the depth-degraded projection. This solution estimates a global constant fundamental matrix, a variable disparity-guided translation vector, and a depth compensation term using a very simple network. Although the 0.5D relative disparity provided by the FDL does not represent precise depth, it can also significantly reduce the depth ambiguity in feature matching. Therefore, the proposed solution achieves accurate feature-matching results by minimizing the sum of reprojection errors across all matching candidates. On the public light field feature-matching dataset, the proposed solution outperforms existing 2D image feature-matching solutions and light field feature-matching algorithms in terms of matching accuracy and robustness. The code is available online.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2235-2248"},"PeriodicalIF":13.7,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146195721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-13DOI: 10.1109/TIP.2026.3662593
Long Lan;Zhaohui Hu;He Li;Tongliang Liu;Xinwang Liu
Out-of-distribution (OOD) detection plays a crucial role as a mechanism for handling anomalies in computer vision systems. Among existing approaches, outlier exposure (OE), which trains the model with an additional auxiliary OOD dataset, has demonstrated strong effectiveness. However, acquiring clean and well-curated auxiliary OOD data is often infeasible, particularly within large and complex systems. Alternatively, wild outliers, i.e., unlabeled samples collected directly in deployment environments, are abundant and easy to obtain, and recent studies have shown that they can substantially benefit OOD detection learning. Nevertheless, wild outliers typically contain a mixture of in-distribution (ID) and OOD samples. Directly using them as auxiliary OOD data unavoidably exposes the model to adverse supervision signals arising from the contained ID samples. Yet existing methods still lack an effective strategy that can fully leverage wild outliers while suppressing the negative influence introduced by their ID subset. To this end, we propose a simple yet effective method named Clustering for Wild Outlier Exposure (C-WOE), which alleviates the adverse effect of the ID samples contained within wild outliers by reweighting them. Specifically, C-WOE assigns higher weights to real OOD samples and lower weights to ID samples and dynamically updates these weights during training. Theoretically, we establish solid guarantees for the proposed method. Empirically, extensive experiments conducted on various real-world benchmarks and simulated datasets demonstrate that C-WOE notably achieves superior performance compared with state-of-the-art methods, validating its reliability in image processing applications.
{"title":"C-WOE: Clustering for Out-of-Distribution Detection Learning With Wild Outlier Exposure","authors":"Long Lan;Zhaohui Hu;He Li;Tongliang Liu;Xinwang Liu","doi":"10.1109/TIP.2026.3662593","DOIUrl":"10.1109/TIP.2026.3662593","url":null,"abstract":"Out-of-distribution (OOD) detection plays a crucial role as a mechanism for handling anomalies in computer vision systems. Among existing approaches, outlier exposure (OE), which trains the model with an additional auxiliary OOD dataset, has demonstrated strong effectiveness. However, acquiring clean and well-curated auxiliary OOD data is often infeasible, particularly within large and complex systems. Alternatively, wild outliers, i.e., unlabeled samples collected directly in deployment environments, are abundant and easy to obtain, and recent studies have shown that they can substantially benefit OOD detection learning. Nevertheless, wild outliers typically contain a mixture of in-distribution (ID) and OOD samples. Directly using them as auxiliary OOD data unavoidably exposes the model to adverse supervision signals arising from the contained ID samples. Yet existing methods still lack an effective strategy that can fully leverage wild outliers while suppressing the negative influence introduced by their ID subset. To this end, we propose a simple yet effective method named Clustering for Wild Outlier Exposure (C-WOE), which alleviates the adverse effect of the ID samples contained within wild outliers by reweighting them. Specifically, C-WOE assigns higher weights to real OOD samples and lower weights to ID samples and dynamically updates these weights during training. Theoretically, we establish solid guarantees for the proposed method. Empirically, extensive experiments conducted on various real-world benchmarks and simulated datasets demonstrate that C-WOE notably achieves superior performance compared with state-of-the-art methods, validating its reliability in image processing applications.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2066-2079"},"PeriodicalIF":13.7,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146195549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-13DOI: 10.1109/TIP.2026.3662597
Hao Deng;Jinkai Li;Jinxing Li;Jie Wen;Yong Xu
Human motion prediction is a key task in computer vision and human-robot interaction, which has received much attention in recent years. However, existing approaches suffer from two issues: 1) They typically rely only on complete data and overlook real-world challenges such as missing observations. 2) Recent works fail to capture the diverse relations among body parts in different action categories, which limits their prediction performance. To address the above problems, we propose a novel Incomplete human Motion Prediction method through motion Re covery and Structure-Semantic fusion (IMPRESS). Specifically, for motion recovery, we introduce a wavelet-based self-attention module. It captures motion details from high-frequency features and extracts global trends from low-frequency components. To enhance the relations among different body parts, we design a structure-semantic fusion graph convolutional network. Moreover, we employ a dual-channel sliding window attention mechanism to capture motion periodicity, enabling smoother predictions. Extensive experiments on two benchmark datasets (Human3.6M, CMU-MoCap) demonstrate that IMPRESS achieves state-of-the-art average prediction performance under both complete and incomplete observations.
人体运动预测是计算机视觉和人机交互领域的一项关键任务,近年来受到广泛关注。然而,现有的方法存在两个问题:(1)它们通常只依赖于完整的数据,而忽略了现实世界的挑战,如缺失的观测值。(2)最近的研究未能捕捉到不同动作类别中身体部位之间的不同关系,这限制了它们的预测性能。针对上述问题,我们提出了一种新的基于运动恢复和结构语义融合的不完全人体运动预测方法(IMPRESS)。具体来说,对于运动恢复,我们引入了一个基于小波的自注意模块。它从高频特征中捕捉运动细节,并从低频成分中提取全球趋势。为了增强人体不同部位之间的联系,我们设计了一个结构-语义融合图卷积网络。此外,我们采用双通道滑动窗口注意机制来捕获运动周期性,从而实现更平滑的预测。在两个基准数据集(Human3.6M, mu - mocap)上进行的大量实验表明,IMPRESS在完整和不完整观察下都能达到最先进的平均预测性能。
{"title":"IMPRESS: Incomplete Human Motion Prediction via Motion Recovery and Structural-Semantic Fusion","authors":"Hao Deng;Jinkai Li;Jinxing Li;Jie Wen;Yong Xu","doi":"10.1109/TIP.2026.3662597","DOIUrl":"10.1109/TIP.2026.3662597","url":null,"abstract":"Human motion prediction is a key task in computer vision and human-robot interaction, which has received much attention in recent years. However, existing approaches suffer from two issues: 1) They typically rely only on complete data and overlook real-world challenges such as missing observations. 2) Recent works fail to capture the diverse relations among body parts in different action categories, which limits their prediction performance. To address the above problems, we propose a novel Incomplete human Motion Prediction method through motion R<sc>e</small> covery and Structure-Semantic fusion (IMPRESS). Specifically, for motion recovery, we introduce a wavelet-based self-attention module. It captures motion details from high-frequency features and extracts global trends from low-frequency components. To enhance the relations among different body parts, we design a structure-semantic fusion graph convolutional network. Moreover, we employ a dual-channel sliding window attention mechanism to capture motion periodicity, enabling smoother predictions. Extensive experiments on two benchmark datasets (Human3.6M, CMU-MoCap) demonstrate that IMPRESS achieves state-of-the-art average prediction performance under both complete and incomplete observations.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2393-2406"},"PeriodicalIF":13.7,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146195737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-11DOI: 10.1109/TIP.2026.3661814
Shuo Wang;Tianyu Qi;Xingyu Zhu;Yanbin Hao;Beier Zhu;Hanwang Zhang;Meng Wang
Distribution estimation is a pivotal strategy in few-shot learning (FSL) to mitigate data scarcity by sampling from estimated distributions, utilizing statistical properties (mean and variance) transferred from related base categories. However, category-level estimation alone often fails to generate representative samples due to significant dissimilarities between base and novel categories, leading to suboptimal performance. To address this limitation, we propose Hybrid Granularity Distribution Estimation (HGDE), which integrates both coarse-grained category-level statistics and fine-grained instance-level statistics. By leveraging instance statistics from the nearest base samples, HGDE enhances the characterization of novel categories, capturing subtle features that category-level estimation overlooks. These statistics are fused through linear interpolation to form a robust distribution for novel categories, ensuring both diversity and representativeness in generated samples. Additionally, HGDE employs refined estimation techniques, such as weighted summation for mean calculation and principal component retention for covariance, to further improve accuracy. Empirical evaluations on four FSL benchmarks, including Mini-ImageNet, Tiered-ImageNet, CUB and CIFAR-FS, demonstrate that HGDE offers effective distribution estimation capabilities and leads to notable accuracy gains, with improvements of more than 1.8% in 1-shot tasks on CUB. These results highlight HGDE’s ability to balance mean precision and variance diversity, making it a versatile and effective solution for FSL.
{"title":"Hybrid Granularity Distribution Estimation for Few-Shot Learning: Statistics Transfer From Categories and Instances","authors":"Shuo Wang;Tianyu Qi;Xingyu Zhu;Yanbin Hao;Beier Zhu;Hanwang Zhang;Meng Wang","doi":"10.1109/TIP.2026.3661814","DOIUrl":"10.1109/TIP.2026.3661814","url":null,"abstract":"Distribution estimation is a pivotal strategy in few-shot learning (FSL) to mitigate data scarcity by sampling from estimated distributions, utilizing statistical properties (mean and variance) transferred from related base categories. However, category-level estimation alone often fails to generate representative samples due to significant dissimilarities between base and novel categories, leading to suboptimal performance. To address this limitation, we propose Hybrid Granularity Distribution Estimation (HGDE), which integrates both coarse-grained category-level statistics and fine-grained instance-level statistics. By leveraging instance statistics from the nearest base samples, HGDE enhances the characterization of novel categories, capturing subtle features that category-level estimation overlooks. These statistics are fused through linear interpolation to form a robust distribution for novel categories, ensuring both diversity and representativeness in generated samples. Additionally, HGDE employs refined estimation techniques, such as weighted summation for mean calculation and principal component retention for covariance, to further improve accuracy. Empirical evaluations on four FSL benchmarks, including Mini-ImageNet, Tiered-ImageNet, CUB and CIFAR-FS, demonstrate that HGDE offers effective distribution estimation capabilities and leads to notable accuracy gains, with improvements of more than 1.8% in 1-shot tasks on CUB. These results highlight HGDE’s ability to balance mean precision and variance diversity, making it a versatile and effective solution for FSL.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2080-2093"},"PeriodicalIF":13.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-11DOI: 10.1109/TIP.2026.3661868
Zhanyan Tang;Zhihao Wu;Mu Li;Jie Wen;Bob Zhang;Yong Xu;Jianqiang Li
Multimodal perception and fusion play a vital role in uncrewed aerial vehicle (UAV) object detection. Existing methods typically adopt global fusion strategies across modalities. However, due to illumination variation, the effectiveness of RGB and infrared modalities may differ across local regions within the same image, particularly in UAV perspectives where occlusions and dense small objects are prevalent, leading to suboptimal performance of global fusion methods. To address this issue, we propose an adaptive fine-grained fusion network for multimodal UAV object detection. First, we design a local feature consistency-based modality fusion module, which adaptively assigns local fusion weights according to the structural consistency of high-response regions across modalities, thereby enabling more effective aggregation of object-relevant features. Second, we introduce a mutual information-guided feature contrastive loss to encourage the preservation of modality-specific information during the early training phase. Experimental results demonstrate that the proposed method effectively addresses the issue of object occlusion in UAV perspectives, achieving state-of-the-art performance on multimodal UAV object detection benchmarks. Code will be available at https://github.com/lingf5877/AFFNet
{"title":"Adaptive Fine-Grained Fusion Network for Multimodal UAV Object Detection","authors":"Zhanyan Tang;Zhihao Wu;Mu Li;Jie Wen;Bob Zhang;Yong Xu;Jianqiang Li","doi":"10.1109/TIP.2026.3661868","DOIUrl":"10.1109/TIP.2026.3661868","url":null,"abstract":"Multimodal perception and fusion play a vital role in uncrewed aerial vehicle (UAV) object detection. Existing methods typically adopt global fusion strategies across modalities. However, due to illumination variation, the effectiveness of RGB and infrared modalities may differ across local regions within the same image, particularly in UAV perspectives where occlusions and dense small objects are prevalent, leading to suboptimal performance of global fusion methods. To address this issue, we propose an adaptive fine-grained fusion network for multimodal UAV object detection. First, we design a local feature consistency-based modality fusion module, which adaptively assigns local fusion weights according to the structural consistency of high-response regions across modalities, thereby enabling more effective aggregation of object-relevant features. Second, we introduce a mutual information-guided feature contrastive loss to encourage the preservation of modality-specific information during the early training phase. Experimental results demonstrate that the proposed method effectively addresses the issue of object occlusion in UAV perspectives, achieving state-of-the-art performance on multimodal UAV object detection benchmarks. Code will be available at <uri>https://github.com/lingf5877/AFFNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1870-1882"},"PeriodicalIF":13.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Change detection (CD) in hyperspectral images (HSIs) has become an increasingly vital research field in remote sensing. Over the past few years, the adoption of deep learning approaches, particularly convolutional neural network (CNN) and transformer-based architectures have significantly advanced performance in this field. While these models effectively capture spectral-spatial features, they may also introduce redundant or irrelevant spatial information, potentially degrading the accuracy of HSI CD. To address this challenge, a center-pixel and gated mechanism-based attention network (CGMNet) is proposed for HSI CD, leveraging the central pixel’s significance to enhance accuracy and robustness. First, a gated-based center spatial attention (GCSA) module is designed to emphasize spatial relationships surrounding the central pixel. By incorporating gating mechanisms, GCSA selectively enhances relevant spatial features while suppressing irrelevant information. Second, a gated-based spectral attention (GSA) module is proposed to dynamically highlight the most significant spectral features, ensuring an effective spectral representation. Finally, a global transform fusion (GTF) module is proposed to capture global contextual information and to fuse it with the extracted spatial and spectral features. Moreover, we introduce a novel benchmark dataset, named the Hangzhou Bay (HZB), specifically designed to advance coastal remote sensing research. Experimental evaluations conducted on three publicly available datasets, as well as the HZB dataset, show that our CGMNet consistently outperforms some state-of-the-art methods in the HSI CD task. The source code of the proposed CGMNet, along with the HZB dataset, will be made publicly available at https://github.com/creativeXin/CGMNet
{"title":"CGMNet: A Center-Pixel and Gated Mechanism-Based Attention Network for Hyperspectral Change Detection","authors":"Lanxin Wu;Jiangtao Peng;Bing Yang;Weiwei Sun;Mingzhu Huang","doi":"10.1109/TIP.2026.3661851","DOIUrl":"10.1109/TIP.2026.3661851","url":null,"abstract":"Change detection (CD) in hyperspectral images (HSIs) has become an increasingly vital research field in remote sensing. Over the past few years, the adoption of deep learning approaches, particularly convolutional neural network (CNN) and transformer-based architectures have significantly advanced performance in this field. While these models effectively capture spectral-spatial features, they may also introduce redundant or irrelevant spatial information, potentially degrading the accuracy of HSI CD. To address this challenge, a center-pixel and gated mechanism-based attention network (CGMNet) is proposed for HSI CD, leveraging the central pixel’s significance to enhance accuracy and robustness. First, a gated-based center spatial attention (GCSA) module is designed to emphasize spatial relationships surrounding the central pixel. By incorporating gating mechanisms, GCSA selectively enhances relevant spatial features while suppressing irrelevant information. Second, a gated-based spectral attention (GSA) module is proposed to dynamically highlight the most significant spectral features, ensuring an effective spectral representation. Finally, a global transform fusion (GTF) module is proposed to capture global contextual information and to fuse it with the extracted spatial and spectral features. Moreover, we introduce a novel benchmark dataset, named the Hangzhou Bay (HZB), specifically designed to advance coastal remote sensing research. Experimental evaluations conducted on three publicly available datasets, as well as the HZB dataset, show that our CGMNet consistently outperforms some state-of-the-art methods in the HSI CD task. The source code of the proposed CGMNet, along with the HZB dataset, will be made publicly available at <uri>https://github.com/creativeXin/CGMNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1951-1965"},"PeriodicalIF":13.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146161245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}