Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657170
Yuqian Ma,Youfa Liu,Bo Du
Few-shot class incremental learning (FSCIL) aims to continuously learn new classes from limited training samples while retaining previously acquired knowledge. Existing approaches are not fully capable of balancing stability and plasticity in dynamic scenarios. To overcome this limitation, we introduce a novel FSCIL framework that leverages graph neural networks (GNNs) to model interdependencies between different categories and enhance cross-modal alignment. Our framework incorporates three key components: (1) a Graph Isomorphism Network (GIN) to propagate contextual relationships among prompts; (2) a Hamiltonian Graph Network with Energy Conservation (HGN-EC) to stabilize training dynamics via energy conservation constraints; and (3) an Adversarially Constrained Graph Autoencoder (ACGA) to enforce latent space consistency. By integrating these components with a parameter-efficient CLIP backbone, our method dynamically adapts graph structures to model semantic correlations between textual and visual modalities. Additionally, contrastive learning with energy-based regularization is employed to mitigate catastrophic forgetting and improve generalization. Comprehensive experiments on benchmark datasets validate the framework's incremental accuracy and stability compared to state-of-the-art baselines. This work advances FSCIL by unifying graph-based relational reasoning with physics-inspired optimization, offering a scalable and interpretable framework.
Few-shot class incremental learning (FSCIL)旨在从有限的训练样本中不断学习新的类,同时保留先前获得的知识。现有的方法不能完全平衡动态场景下的稳定性和可塑性。为了克服这一限制,我们引入了一种新的FSCIL框架,该框架利用图神经网络(gnn)来模拟不同类别之间的相互依赖关系,并增强跨模态对齐。我们的框架包含三个关键组件:(1)在提示符之间传播上下文关系的图同构网络(GIN);(2)利用能量守恒约束稳定训练动态的哈密顿图网络(HGN-EC);(3)采用对抗约束图自编码器(ACGA)来增强潜在空间一致性。通过将这些组件与参数高效的CLIP主干集成,我们的方法动态地调整图结构来建模文本和视觉模式之间的语义相关性。此外,采用基于能量的正则化对比学习来减轻灾难性遗忘和提高泛化。在基准数据集上的综合实验验证了该框架与最先进的基线相比的增量精度和稳定性。这项工作通过将基于图的关系推理与物理启发的优化结合起来,提供了一个可扩展和可解释的框架,从而推动了FSCIL的发展。
{"title":"A Few-Shot Class Incremental Learning Method Using Graph Neural Networks.","authors":"Yuqian Ma,Youfa Liu,Bo Du","doi":"10.1109/tip.2026.3657170","DOIUrl":"https://doi.org/10.1109/tip.2026.3657170","url":null,"abstract":"Few-shot class incremental learning (FSCIL) aims to continuously learn new classes from limited training samples while retaining previously acquired knowledge. Existing approaches are not fully capable of balancing stability and plasticity in dynamic scenarios. To overcome this limitation, we introduce a novel FSCIL framework that leverages graph neural networks (GNNs) to model interdependencies between different categories and enhance cross-modal alignment. Our framework incorporates three key components: (1) a Graph Isomorphism Network (GIN) to propagate contextual relationships among prompts; (2) a Hamiltonian Graph Network with Energy Conservation (HGN-EC) to stabilize training dynamics via energy conservation constraints; and (3) an Adversarially Constrained Graph Autoencoder (ACGA) to enforce latent space consistency. By integrating these components with a parameter-efficient CLIP backbone, our method dynamically adapts graph structures to model semantic correlations between textual and visual modalities. Additionally, contrastive learning with energy-based regularization is employed to mitigate catastrophic forgetting and improve generalization. Comprehensive experiments on benchmark datasets validate the framework's incremental accuracy and stability compared to state-of-the-art baselines. This work advances FSCIL by unifying graph-based relational reasoning with physics-inspired optimization, offering a scalable and interpretable framework.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"52 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657188
Yaru Qiu,Guoxia Wu,Yuanyuan Sun
Synthesizing novel perspectives of complex scenes in high quality using sparse image sequences, especially for those without camera poses, is a challenging task. The key to enhancing accuracy in such scenarios lies in sufficient prior knowledge and accurate camera motion constraints. Therefore, we propose an end-to-end novel view synthesis network named BP-NeRF. It is capable of using sequences of sparse images captured in indoor and outdoor complex scenes to estimate camera motion trajectories and generate novel view images. Firstly, to address the issue of inaccurate prediction of depth map caused by insufficient overlapping features in sparse images, we designed the RDP-Net module to generate depth maps for sparse image sequences and calculate the depth accuracy of these maps, providing the network with a reliable depth prior. Secondly, to enhance the accuracy of camera pose estimation, we construct a loss function based on the geometric consistency of 2D and 3D feature variations between frames, improving the accuracy and robustness of the network's estimations. We conducted experimental evaluations on the LLFF and Tanks datasets, and the results show that, compared to the current mainstream methods, BP-NeRF can generate more accurate novel views without camera poses.
{"title":"BP-NeRF: End-to-End Neural Radiance Fields for Sparse Images without Camera Pose in Complex Scenes.","authors":"Yaru Qiu,Guoxia Wu,Yuanyuan Sun","doi":"10.1109/tip.2026.3657188","DOIUrl":"https://doi.org/10.1109/tip.2026.3657188","url":null,"abstract":"Synthesizing novel perspectives of complex scenes in high quality using sparse image sequences, especially for those without camera poses, is a challenging task. The key to enhancing accuracy in such scenarios lies in sufficient prior knowledge and accurate camera motion constraints. Therefore, we propose an end-to-end novel view synthesis network named BP-NeRF. It is capable of using sequences of sparse images captured in indoor and outdoor complex scenes to estimate camera motion trajectories and generate novel view images. Firstly, to address the issue of inaccurate prediction of depth map caused by insufficient overlapping features in sparse images, we designed the RDP-Net module to generate depth maps for sparse image sequences and calculate the depth accuracy of these maps, providing the network with a reliable depth prior. Secondly, to enhance the accuracy of camera pose estimation, we construct a loss function based on the geometric consistency of 2D and 3D feature variations between frames, improving the accuracy and robustness of the network's estimations. We conducted experimental evaluations on the LLFF and Tanks datasets, and the results show that, compared to the current mainstream methods, BP-NeRF can generate more accurate novel views without camera poses.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"31 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657209
Puhong Duan,Shiyu Jin,Xiaotian Lu,Lianhui Liang,Xudong Kang,Antonio Plaza
Cross-scene hyperspectral image classification aims to identify a new scene in target domain via learned knowledge from source domain using limited training samples. Existing cross-scene alignment approaches focus on aligning the global feature distribution between the source and target domains while overlooking the fine-grained alignment at different levels. Moreover, they mainly use Transformer architectures to model long-range dependencies across different channels but confront efficiency challenges due to their quadratic complexity, which limits classification performance in unsupervised domain adaptation tasks. To address these issues, a new domain-adaptive Mamba (DAMamba) is proposed for cross-scene hyperspectral image classification. First, a spectral-spatial Mamba is developed to extract high-order semantic features from the input data. Then, a domain-invariant prototype alignment method is proposed from three perspectives, i.e., intra-domain, inter-domain, and mini-batch, to produce reliable pseudo-labels and mitigate the spectral shift between the source and target domains. Finally, a fully connected layer is applied to the aligned features in the target domain to obtain the final classification results. Extensive evaluations across diverse cross-scene datasets demonstrate that our DAMamba outperforms existing state-of-the-art methods in classification accuracy and computing time. The code of this paper is available at https://github.com/PuhongDuan/DAMamba.
{"title":"Domain-Adaptive Mamba for Cross-Scene Hyperspectral Image Classification.","authors":"Puhong Duan,Shiyu Jin,Xiaotian Lu,Lianhui Liang,Xudong Kang,Antonio Plaza","doi":"10.1109/tip.2026.3657209","DOIUrl":"https://doi.org/10.1109/tip.2026.3657209","url":null,"abstract":"Cross-scene hyperspectral image classification aims to identify a new scene in target domain via learned knowledge from source domain using limited training samples. Existing cross-scene alignment approaches focus on aligning the global feature distribution between the source and target domains while overlooking the fine-grained alignment at different levels. Moreover, they mainly use Transformer architectures to model long-range dependencies across different channels but confront efficiency challenges due to their quadratic complexity, which limits classification performance in unsupervised domain adaptation tasks. To address these issues, a new domain-adaptive Mamba (DAMamba) is proposed for cross-scene hyperspectral image classification. First, a spectral-spatial Mamba is developed to extract high-order semantic features from the input data. Then, a domain-invariant prototype alignment method is proposed from three perspectives, i.e., intra-domain, inter-domain, and mini-batch, to produce reliable pseudo-labels and mitigate the spectral shift between the source and target domains. Finally, a fully connected layer is applied to the aligned features in the target domain to obtain the final classification results. Extensive evaluations across diverse cross-scene datasets demonstrate that our DAMamba outperforms existing state-of-the-art methods in classification accuracy and computing time. The code of this paper is available at https://github.com/PuhongDuan/DAMamba.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"473 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657219
Nan Wang,Anjing Guo,Renwei Dian,Shutao Li
Existing mosaic-based snapshot hyperspectral imaging systems struggle to capture high resolution (HR) hyperspectral image (HSI), limiting its application. Fusing a low resolution (LR) mosaiced image with an HR panchromatic (PAN) image serves as a feasible solution to obtain the HR HSI. Therefore, we propose a dual-sensor based HSI imaging system, combining a 4×4 spectral filter array (SFA) mosaiced image sensor with a co-aligned PAN image sensor to provide complementary spatial-spectral information. To reconstruct HR HSI, we propose an unsupervised equivariant imaging (EI)-based training framework with a learnable degradation function, overcoming the inaccessibility of ground truth and spectral response function (SRF). Specifically, we formulate the degradation process as a combination of 8×8 mosaicing and 2×2 average downsampling for the LR mosaiced image, while modeling the PAN image as a linear projection of the HR HSI using SRF. Since parameters of SRF are inaccessible, we propose to make them learnable to have an accurate estimation. By enforcing transformation equivariance between the input-output pair of the fusion network, the proposed framework ensures the reconstructed HSI preserves spatial-spectral consistency without relying on paired supervision. Furthermore, we instantiate the proposed HSI imaging system and collect a real-world dataset of 60 paired mosaiced / PAN images. The mosaiced image exhibits 16 spectral bands ranging from 722 to 896 nm and 1020×1104 spatial pixels while the PAN image exhibits 2040×2208 spatial pixels. Comprehensive experiments demonstrate that the proposed method exhibits high spatial consistency and spectral fidelity while maintaining computational efficiency.
{"title":"Equivariant High-Resolution Hyperspectral Imaging via Mosaiced and PAN Image Fusion.","authors":"Nan Wang,Anjing Guo,Renwei Dian,Shutao Li","doi":"10.1109/tip.2026.3657219","DOIUrl":"https://doi.org/10.1109/tip.2026.3657219","url":null,"abstract":"Existing mosaic-based snapshot hyperspectral imaging systems struggle to capture high resolution (HR) hyperspectral image (HSI), limiting its application. Fusing a low resolution (LR) mosaiced image with an HR panchromatic (PAN) image serves as a feasible solution to obtain the HR HSI. Therefore, we propose a dual-sensor based HSI imaging system, combining a 4×4 spectral filter array (SFA) mosaiced image sensor with a co-aligned PAN image sensor to provide complementary spatial-spectral information. To reconstruct HR HSI, we propose an unsupervised equivariant imaging (EI)-based training framework with a learnable degradation function, overcoming the inaccessibility of ground truth and spectral response function (SRF). Specifically, we formulate the degradation process as a combination of 8×8 mosaicing and 2×2 average downsampling for the LR mosaiced image, while modeling the PAN image as a linear projection of the HR HSI using SRF. Since parameters of SRF are inaccessible, we propose to make them learnable to have an accurate estimation. By enforcing transformation equivariance between the input-output pair of the fusion network, the proposed framework ensures the reconstructed HSI preserves spatial-spectral consistency without relying on paired supervision. Furthermore, we instantiate the proposed HSI imaging system and collect a real-world dataset of 60 paired mosaiced / PAN images. The mosaiced image exhibits 16 spectral bands ranging from 722 to 896 nm and 1020×1104 spatial pixels while the PAN image exhibits 2040×2208 spatial pixels. Comprehensive experiments demonstrate that the proposed method exhibits high spatial consistency and spectral fidelity while maintaining computational efficiency.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"42 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3651982
Yaguan Qian,Yaxin Kong,Qiqi Bao,Zhaoquan Gu,Bin Wang,Shouling Ji,Jianping Zhang,Zhen Lei
Vision-Language Pretrained (VLP) models exhibit strong multimodal understanding and reasoning capabilities, finding wide application in tasks such as image-text retrieval and visual grounding. However, they remain highly vulnerable to adversarial attacks, posing serious reliability concerns in safety-critical scenarios. We observe that existing adversarial examples optimization methods typically rely on individual features from the other modality as guidance, causing the crafted adversarial examples to overfit that modality's learning preferences and thus limiting their transferability. In order to further enhance the transferability of adversarial examples, we propose a novel adversarial attack framework, I&CA (Individual & Common feature Attack), which simultaneously considers individual features within each modality and common features cross-modal interactions. Concretely, I&CA first drives divergence among individual features within each modality to disrupt single-modality learning, and then suppresses the expression of common features during cross-modal interactions, thereby undermining the robustness of the fusion mechanism. In addition, to prevent adversarial perturbations from overfitting to the learning bias of the other modality, which may distort the representation of common features, we simultaneously introduce augmentation strategies to both modalities. Across various experimental settings and widely recognized multimodal benchmarks, the I&CA framework achieves an average transferability improvement of 6.15% over the state-of-the-art DRA method, delivering significant performance gains in both cross-model and cross-task attack scenarios.
视觉语言预训练(VLP)模型表现出强大的多模态理解和推理能力,在图像文本检索和视觉基础等任务中得到广泛应用。然而,它们仍然极易受到对抗性攻击,在安全关键场景中引发严重的可靠性问题。我们观察到,现有的对抗性示例优化方法通常依赖于其他模态的单个特征作为指导,导致精心制作的对抗性示例过度拟合该模态的学习偏好,从而限制了它们的可转移性。为了进一步增强对抗性示例的可转移性,我们提出了一种新的对抗性攻击框架I&CA (Individual & Common feature attack),该框架同时考虑了每个模态中的个体特征和跨模态交互的共同特征。具体而言,I&CA首先驱动每个模态中单个特征之间的分歧,从而破坏单模态学习,然后抑制跨模态交互过程中共同特征的表达,从而破坏融合机制的鲁棒性。此外,为了防止对抗性扰动过度拟合到其他模态的学习偏差,这可能会扭曲共同特征的表示,我们同时向两种模态引入增强策略。在各种实验设置和广泛认可的多模态基准测试中,I&CA框架实现了比最先进的DRA方法平均可转移性提高6.15%,在跨模型和跨任务攻击场景中都提供了显着的性能提升。
{"title":"Individual & Common Attack: Enhancing Transferability in VLP Models through Modal Feature Exploitation.","authors":"Yaguan Qian,Yaxin Kong,Qiqi Bao,Zhaoquan Gu,Bin Wang,Shouling Ji,Jianping Zhang,Zhen Lei","doi":"10.1109/tip.2026.3651982","DOIUrl":"https://doi.org/10.1109/tip.2026.3651982","url":null,"abstract":"Vision-Language Pretrained (VLP) models exhibit strong multimodal understanding and reasoning capabilities, finding wide application in tasks such as image-text retrieval and visual grounding. However, they remain highly vulnerable to adversarial attacks, posing serious reliability concerns in safety-critical scenarios. We observe that existing adversarial examples optimization methods typically rely on individual features from the other modality as guidance, causing the crafted adversarial examples to overfit that modality's learning preferences and thus limiting their transferability. In order to further enhance the transferability of adversarial examples, we propose a novel adversarial attack framework, I&CA (Individual & Common feature Attack), which simultaneously considers individual features within each modality and common features cross-modal interactions. Concretely, I&CA first drives divergence among individual features within each modality to disrupt single-modality learning, and then suppresses the expression of common features during cross-modal interactions, thereby undermining the robustness of the fusion mechanism. In addition, to prevent adversarial perturbations from overfitting to the learning bias of the other modality, which may distort the representation of common features, we simultaneously introduce augmentation strategies to both modalities. Across various experimental settings and widely recognized multimodal benchmarks, the I&CA framework achieves an average transferability improvement of 6.15% over the state-of-the-art DRA method, delivering significant performance gains in both cross-model and cross-task attack scenarios.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"183 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-focus image fusion (MFIF) addresses the challenge of partial focus by integrating multiple source images taken at different focal depths. Unlike most existing methods that rely on complex loss functions or large-scale synthetic datasets, this study approaches MFIF from a novel perspective: optimizing the input space. The core idea is to construct a high-quality MFIF input space in a cost-effective manner by using intermediate features from well-trained, non-MFIF networks. To this end, we propose a cascaded framework comprising two feature extractors, a Feature Distillation and Fusion Module (FDFM), and a focus segmentation network YUNet. Based on our observation that discrepancy and edge features are essential for MFIF, we select a image deblurring network and a salient object detection network as feature extractors. To transform these extracted features into an MFIF-suitable input space, we propose FDFM as a training-free feature adapter. To make FDFM compatible with high-dimensional feature maps, we extend the manifold theory from the edge-preserving field and design a novel isometric domain transformation. Extensive experiments on six benchmark datasets show that (i) our model consistently outperforms 13 state-of-the-art methods in both qualitative and quantitative evaluations, and (ii) the constructed input space can directly enhance the performance of many MFIF models without additional requirements.
{"title":"Rethinking Multi-Focus Image Fusion: An Input Space Optimisation View.","authors":"Zeyu Wang,Shuang Yu,Haoran Duan,Shidong Wang,Yang Long,Ling Shao","doi":"10.1109/tip.2026.3654370","DOIUrl":"https://doi.org/10.1109/tip.2026.3654370","url":null,"abstract":"Multi-focus image fusion (MFIF) addresses the challenge of partial focus by integrating multiple source images taken at different focal depths. Unlike most existing methods that rely on complex loss functions or large-scale synthetic datasets, this study approaches MFIF from a novel perspective: optimizing the input space. The core idea is to construct a high-quality MFIF input space in a cost-effective manner by using intermediate features from well-trained, non-MFIF networks. To this end, we propose a cascaded framework comprising two feature extractors, a Feature Distillation and Fusion Module (FDFM), and a focus segmentation network YUNet. Based on our observation that discrepancy and edge features are essential for MFIF, we select a image deblurring network and a salient object detection network as feature extractors. To transform these extracted features into an MFIF-suitable input space, we propose FDFM as a training-free feature adapter. To make FDFM compatible with high-dimensional feature maps, we extend the manifold theory from the edge-preserving field and design a novel isometric domain transformation. Extensive experiments on six benchmark datasets show that (i) our model consistently outperforms 13 state-of-the-art methods in both qualitative and quantitative evaluations, and (ii) the constructed input space can directly enhance the performance of many MFIF models without additional requirements.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"87 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate and efficient volumetric medical image segmentation is vital for clinical diagnosis, pre-operative planning, and disease-progression monitoring. Conventional convolutional neural networks (CNNs) struggle to capture long-range contextual information, whereas Transformer-based methods suffer from quadratic computational complexity, making it challenging to couple global modeling with high efficiency. To address these limitations, we explore an effective yet accurate segmentation model for volumetric data. Specifically, we introduce a novel linear-complexity sequence modeling technique, RWKV, and leverage it to design a Tri-directional Spatial Enhancement RWKV (TSE-R) block; this module performs global modeling via RWKV and incorporates two optimizations tailored to three-dimensional data: (1) a spatial-shift strategy that enlarges the local receptive field and facilitates inter-block interaction, thereby alleviating the structural information loss caused by sequence serialization; and (2) a tri-directional scanning mechanism that constructs sequences along three distinct directions, applies global modeling via WKV, and fuses them with learnable weights to preserve the inherent 3D spatial structure. Building upon the TSE-R block, we develop an end-to-end 3D segmentation network, termed U-RWKV, and extensive experiments on three public 3D medical segmentation benchmarks demonstrate that U-RWKV outperforms state-of-the-art CNN-, Transformer-, and Mamba-based counterparts, achieving a Dice score of 87.21% on the Synapse multi-organ abdominal dataset while reducing parameter count by a factor of 16.08 compared with leading methods.
{"title":"U-RWKV: Accurate and Efficient Volumetric Medical Image Segmentation via RWKV.","authors":"Hongyu Cai,Yifan Wang,Liu Wang,Jian Zhao,Zhejun Kuang","doi":"10.1109/tip.2026.3654389","DOIUrl":"https://doi.org/10.1109/tip.2026.3654389","url":null,"abstract":"Accurate and efficient volumetric medical image segmentation is vital for clinical diagnosis, pre-operative planning, and disease-progression monitoring. Conventional convolutional neural networks (CNNs) struggle to capture long-range contextual information, whereas Transformer-based methods suffer from quadratic computational complexity, making it challenging to couple global modeling with high efficiency. To address these limitations, we explore an effective yet accurate segmentation model for volumetric data. Specifically, we introduce a novel linear-complexity sequence modeling technique, RWKV, and leverage it to design a Tri-directional Spatial Enhancement RWKV (TSE-R) block; this module performs global modeling via RWKV and incorporates two optimizations tailored to three-dimensional data: (1) a spatial-shift strategy that enlarges the local receptive field and facilitates inter-block interaction, thereby alleviating the structural information loss caused by sequence serialization; and (2) a tri-directional scanning mechanism that constructs sequences along three distinct directions, applies global modeling via WKV, and fuses them with learnable weights to preserve the inherent 3D spatial structure. Building upon the TSE-R block, we develop an end-to-end 3D segmentation network, termed U-RWKV, and extensive experiments on three public 3D medical segmentation benchmarks demonstrate that U-RWKV outperforms state-of-the-art CNN-, Transformer-, and Mamba-based counterparts, achieving a Dice score of 87.21% on the Synapse multi-organ abdominal dataset while reducing parameter count by a factor of 16.08 compared with leading methods.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"39 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3654413
Wenxu Wang,Weizhen Wang,Qianjin Feng,Yu Zhang,Zhenyuan Ning
The similar textures, diverse shapes and blurred boundaries of thyroid lesions in ultrasound images pose a significant challenge to accurate segmentation. Although several methods have been proposed to alleviate the aforementioned issues, their generalization is hindered by limited annotation data and insufficient ability to distinguish lesion from its surrounding tissues, especially in the presence of noise and outlier. Additionally, most existing methods lack uncertainty estimation which is essential for providing trustworthy results and identifying potential mispredictions. To this end, we propose knowledge-prompted trustworthy disentangled learning (KPTD) for thyroid ultrasound segmentation with limited annotations. The proposed method consists of three key components: 1) Knowledge-aware prompt learning (KAPL) encodes TI-RADS reports into text features and introduces learnable prompts to extract contextual embeddings, which assist in generating region activation maps (serving as pseudo-labels for unlabeled images). 2) Foreground-background disentangled learning (FBDL) leverages region activation maps to disentangle foreground and background representations, refining their prototype distributions through a contrastive learning strategy to enhance the model's discrimination and robustness. 3) Foreground-background trustworthy fusion (FBTF) integrates the foreground and background representations and estimates their uncertainty based on evidence theory, providing trustworthy segmentation results. Experimental results show that KPTD achieves superior segmentation performance under limited annotations, significantly outperforming state-of-the-art methods.
{"title":"Knowledge-Prompted Trustworthy Disentangled Learning for Thyroid Ultrasound Segmentation with Limited Annotations.","authors":"Wenxu Wang,Weizhen Wang,Qianjin Feng,Yu Zhang,Zhenyuan Ning","doi":"10.1109/tip.2026.3654413","DOIUrl":"https://doi.org/10.1109/tip.2026.3654413","url":null,"abstract":"The similar textures, diverse shapes and blurred boundaries of thyroid lesions in ultrasound images pose a significant challenge to accurate segmentation. Although several methods have been proposed to alleviate the aforementioned issues, their generalization is hindered by limited annotation data and insufficient ability to distinguish lesion from its surrounding tissues, especially in the presence of noise and outlier. Additionally, most existing methods lack uncertainty estimation which is essential for providing trustworthy results and identifying potential mispredictions. To this end, we propose knowledge-prompted trustworthy disentangled learning (KPTD) for thyroid ultrasound segmentation with limited annotations. The proposed method consists of three key components: 1) Knowledge-aware prompt learning (KAPL) encodes TI-RADS reports into text features and introduces learnable prompts to extract contextual embeddings, which assist in generating region activation maps (serving as pseudo-labels for unlabeled images). 2) Foreground-background disentangled learning (FBDL) leverages region activation maps to disentangle foreground and background representations, refining their prototype distributions through a contrastive learning strategy to enhance the model's discrimination and robustness. 3) Foreground-background trustworthy fusion (FBTF) integrates the foreground and background representations and estimates their uncertainty based on evidence theory, providing trustworthy segmentation results. Experimental results show that KPTD achieves superior segmentation performance under limited annotations, significantly outperforming state-of-the-art methods.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"66 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3654422
Hathai Kaewkorn,Lifang Zhou,Weisheng Li,Chengjiang Long
Face detection accuracy significantly decreases under rotational variations, including in-plane (RIP) and out-of-plane (ROP) rotations. ROP is particularly problematic due to its impact on landmark distortion, which leads to inaccurate face center localization. Meanwhile, many existing rotation-invariant models are primarily designed to handle RIP, they often fail under ROP because they lack the ability to capture semantic and topological relationships. Moreover, existing datasets frequently suffer from unreliable landmark annotations caused by imperfect ground truth labeling, the absence of precise center annotations, and imbalanced data across different rotation angles. To address these challenges, we propose a topology-guided semantic face center estimation method that leverages graph-based landmark relationships to preserve structural integrity under both RIP and ROP. Additionally, we construct a rotation-aware face dataset with accurate face center annotations and balanced rotational diversity to support training under extreme pose conditions. Next, we introduce a Hybrid-ViT model that fuses CNN spatial features with transformer-based global context and employ a center-guided module for robust landmark localization under extreme rotations. In order to evaluate center quality, we further design a hybrid metric that combines topological geometry with semantic perception for a more comprehensive evaluation of face center accuracy. Finally, experimental results demonstrate that our method outperforms state-of-the-art models in cross-dataset evaluations. Code: https://github.com/Catster111/TCE_RIFD.
{"title":"Topology-Guided Semantic Face Center Estimation for Rotation-Invariant Face Detection.","authors":"Hathai Kaewkorn,Lifang Zhou,Weisheng Li,Chengjiang Long","doi":"10.1109/tip.2026.3654422","DOIUrl":"https://doi.org/10.1109/tip.2026.3654422","url":null,"abstract":"Face detection accuracy significantly decreases under rotational variations, including in-plane (RIP) and out-of-plane (ROP) rotations. ROP is particularly problematic due to its impact on landmark distortion, which leads to inaccurate face center localization. Meanwhile, many existing rotation-invariant models are primarily designed to handle RIP, they often fail under ROP because they lack the ability to capture semantic and topological relationships. Moreover, existing datasets frequently suffer from unreliable landmark annotations caused by imperfect ground truth labeling, the absence of precise center annotations, and imbalanced data across different rotation angles. To address these challenges, we propose a topology-guided semantic face center estimation method that leverages graph-based landmark relationships to preserve structural integrity under both RIP and ROP. Additionally, we construct a rotation-aware face dataset with accurate face center annotations and balanced rotational diversity to support training under extreme pose conditions. Next, we introduce a Hybrid-ViT model that fuses CNN spatial features with transformer-based global context and employ a center-guided module for robust landmark localization under extreme rotations. In order to evaluate center quality, we further design a hybrid metric that combines topological geometry with semantic perception for a more comprehensive evaluation of face center accuracy. Finally, experimental results demonstrate that our method outperforms state-of-the-art models in cross-dataset evaluations. Code: https://github.com/Catster111/TCE_RIFD.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"2 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}