Accurate and efficient volumetric medical image segmentation is vital for clinical diagnosis, pre-operative planning, and disease-progression monitoring. Conventional convolutional neural networks (CNNs) struggle to capture long-range contextual information, whereas Transformer-based methods suffer from quadratic computational complexity, making it challenging to couple global modeling with high efficiency. To address these limitations, we explore an effective yet accurate segmentation model for volumetric data. Specifically, we introduce a novel linear-complexity sequence modeling technique, RWKV, and leverage it to design a Tri-directional Spatial Enhancement RWKV (TSE-R) block; this module performs global modeling via RWKV and incorporates two optimizations tailored to three-dimensional data: (1) a spatial-shift strategy that enlarges the local receptive field and facilitates inter-block interaction, thereby alleviating the structural information loss caused by sequence serialization; and (2) a tri-directional scanning mechanism that constructs sequences along three distinct directions, applies global modeling via WKV, and fuses them with learnable weights to preserve the inherent 3D spatial structure. Building upon the TSE-R block, we develop an end-to-end 3D segmentation network, termed U-RWKV, and extensive experiments on three public 3D medical segmentation benchmarks demonstrate that U-RWKV outperforms state-of-the-art CNN-, Transformer-, and Mamba-based counterparts, achieving a Dice score of 87.21% on the Synapse multi-organ abdominal dataset while reducing parameter count by a factor of 16.08 compared with leading methods.
{"title":"U-RWKV: Accurate and Efficient Volumetric Medical Image Segmentation via RWKV.","authors":"Hongyu Cai,Yifan Wang,Liu Wang,Jian Zhao,Zhejun Kuang","doi":"10.1109/tip.2026.3654389","DOIUrl":"https://doi.org/10.1109/tip.2026.3654389","url":null,"abstract":"Accurate and efficient volumetric medical image segmentation is vital for clinical diagnosis, pre-operative planning, and disease-progression monitoring. Conventional convolutional neural networks (CNNs) struggle to capture long-range contextual information, whereas Transformer-based methods suffer from quadratic computational complexity, making it challenging to couple global modeling with high efficiency. To address these limitations, we explore an effective yet accurate segmentation model for volumetric data. Specifically, we introduce a novel linear-complexity sequence modeling technique, RWKV, and leverage it to design a Tri-directional Spatial Enhancement RWKV (TSE-R) block; this module performs global modeling via RWKV and incorporates two optimizations tailored to three-dimensional data: (1) a spatial-shift strategy that enlarges the local receptive field and facilitates inter-block interaction, thereby alleviating the structural information loss caused by sequence serialization; and (2) a tri-directional scanning mechanism that constructs sequences along three distinct directions, applies global modeling via WKV, and fuses them with learnable weights to preserve the inherent 3D spatial structure. Building upon the TSE-R block, we develop an end-to-end 3D segmentation network, termed U-RWKV, and extensive experiments on three public 3D medical segmentation benchmarks demonstrate that U-RWKV outperforms state-of-the-art CNN-, Transformer-, and Mamba-based counterparts, achieving a Dice score of 87.21% on the Synapse multi-organ abdominal dataset while reducing parameter count by a factor of 16.08 compared with leading methods.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"39 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3654413
Wenxu Wang,Weizhen Wang,Qianjin Feng,Yu Zhang,Zhenyuan Ning
The similar textures, diverse shapes and blurred boundaries of thyroid lesions in ultrasound images pose a significant challenge to accurate segmentation. Although several methods have been proposed to alleviate the aforementioned issues, their generalization is hindered by limited annotation data and insufficient ability to distinguish lesion from its surrounding tissues, especially in the presence of noise and outlier. Additionally, most existing methods lack uncertainty estimation which is essential for providing trustworthy results and identifying potential mispredictions. To this end, we propose knowledge-prompted trustworthy disentangled learning (KPTD) for thyroid ultrasound segmentation with limited annotations. The proposed method consists of three key components: 1) Knowledge-aware prompt learning (KAPL) encodes TI-RADS reports into text features and introduces learnable prompts to extract contextual embeddings, which assist in generating region activation maps (serving as pseudo-labels for unlabeled images). 2) Foreground-background disentangled learning (FBDL) leverages region activation maps to disentangle foreground and background representations, refining their prototype distributions through a contrastive learning strategy to enhance the model's discrimination and robustness. 3) Foreground-background trustworthy fusion (FBTF) integrates the foreground and background representations and estimates their uncertainty based on evidence theory, providing trustworthy segmentation results. Experimental results show that KPTD achieves superior segmentation performance under limited annotations, significantly outperforming state-of-the-art methods.
{"title":"Knowledge-Prompted Trustworthy Disentangled Learning for Thyroid Ultrasound Segmentation with Limited Annotations.","authors":"Wenxu Wang,Weizhen Wang,Qianjin Feng,Yu Zhang,Zhenyuan Ning","doi":"10.1109/tip.2026.3654413","DOIUrl":"https://doi.org/10.1109/tip.2026.3654413","url":null,"abstract":"The similar textures, diverse shapes and blurred boundaries of thyroid lesions in ultrasound images pose a significant challenge to accurate segmentation. Although several methods have been proposed to alleviate the aforementioned issues, their generalization is hindered by limited annotation data and insufficient ability to distinguish lesion from its surrounding tissues, especially in the presence of noise and outlier. Additionally, most existing methods lack uncertainty estimation which is essential for providing trustworthy results and identifying potential mispredictions. To this end, we propose knowledge-prompted trustworthy disentangled learning (KPTD) for thyroid ultrasound segmentation with limited annotations. The proposed method consists of three key components: 1) Knowledge-aware prompt learning (KAPL) encodes TI-RADS reports into text features and introduces learnable prompts to extract contextual embeddings, which assist in generating region activation maps (serving as pseudo-labels for unlabeled images). 2) Foreground-background disentangled learning (FBDL) leverages region activation maps to disentangle foreground and background representations, refining their prototype distributions through a contrastive learning strategy to enhance the model's discrimination and robustness. 3) Foreground-background trustworthy fusion (FBTF) integrates the foreground and background representations and estimates their uncertainty based on evidence theory, providing trustworthy segmentation results. Experimental results show that KPTD achieves superior segmentation performance under limited annotations, significantly outperforming state-of-the-art methods.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"66 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3654422
Hathai Kaewkorn,Lifang Zhou,Weisheng Li,Chengjiang Long
Face detection accuracy significantly decreases under rotational variations, including in-plane (RIP) and out-of-plane (ROP) rotations. ROP is particularly problematic due to its impact on landmark distortion, which leads to inaccurate face center localization. Meanwhile, many existing rotation-invariant models are primarily designed to handle RIP, they often fail under ROP because they lack the ability to capture semantic and topological relationships. Moreover, existing datasets frequently suffer from unreliable landmark annotations caused by imperfect ground truth labeling, the absence of precise center annotations, and imbalanced data across different rotation angles. To address these challenges, we propose a topology-guided semantic face center estimation method that leverages graph-based landmark relationships to preserve structural integrity under both RIP and ROP. Additionally, we construct a rotation-aware face dataset with accurate face center annotations and balanced rotational diversity to support training under extreme pose conditions. Next, we introduce a Hybrid-ViT model that fuses CNN spatial features with transformer-based global context and employ a center-guided module for robust landmark localization under extreme rotations. In order to evaluate center quality, we further design a hybrid metric that combines topological geometry with semantic perception for a more comprehensive evaluation of face center accuracy. Finally, experimental results demonstrate that our method outperforms state-of-the-art models in cross-dataset evaluations. Code: https://github.com/Catster111/TCE_RIFD.
{"title":"Topology-Guided Semantic Face Center Estimation for Rotation-Invariant Face Detection.","authors":"Hathai Kaewkorn,Lifang Zhou,Weisheng Li,Chengjiang Long","doi":"10.1109/tip.2026.3654422","DOIUrl":"https://doi.org/10.1109/tip.2026.3654422","url":null,"abstract":"Face detection accuracy significantly decreases under rotational variations, including in-plane (RIP) and out-of-plane (ROP) rotations. ROP is particularly problematic due to its impact on landmark distortion, which leads to inaccurate face center localization. Meanwhile, many existing rotation-invariant models are primarily designed to handle RIP, they often fail under ROP because they lack the ability to capture semantic and topological relationships. Moreover, existing datasets frequently suffer from unreliable landmark annotations caused by imperfect ground truth labeling, the absence of precise center annotations, and imbalanced data across different rotation angles. To address these challenges, we propose a topology-guided semantic face center estimation method that leverages graph-based landmark relationships to preserve structural integrity under both RIP and ROP. Additionally, we construct a rotation-aware face dataset with accurate face center annotations and balanced rotational diversity to support training under extreme pose conditions. Next, we introduce a Hybrid-ViT model that fuses CNN spatial features with transformer-based global context and employ a center-guided module for robust landmark localization under extreme rotations. In order to evaluate center quality, we further design a hybrid metric that combines topological geometry with semantic perception for a more comprehensive evaluation of face center accuracy. Finally, experimental results demonstrate that our method outperforms state-of-the-art models in cross-dataset evaluations. Code: https://github.com/Catster111/TCE_RIFD.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"2 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3655117
Pengfei Chen,Jiebin Yan,Rajiv Soundararajan,Giuseppe Valenzise,Cai Li,Leida Li
Video Quality Assessment (VQA) strives to computationally emulate human perceptual judgments and has garnered significant attention given its widespread applicability. However, existing methodologies face two primary impediments: (1) limited proficiency in evaluating samples at quality extremes (e.g., severely degraded or near-perfect videos), and (2) insufficient sensitivity to nuanced quality variations arising from a misalignment with human perceptual mechanisms. Although vision-language models offer promising semantic understanding, their reliance on visual encoders pre-trained for high-level tasks often compromises their sensitivity to low-level distortions. To surmount these challenges, we propose the Restoration-Assisted Multi-modality VQA (RAM-VQA) framework. Uniquely, our approach leverages video restoration as a proxy to explicitly model distortion-sensitive features. The framework operates through two synergistic stages: a prompt learning stage that constructs a quality-aware textual space using triple-level references (degraded, restored, and pristine) derived from the restoration process, and a dual-branch evaluation stage that integrates semantic cues with technical quality indicators via spatio-temporal differential analysis. Extensive experiments demonstrate that RAM-VQA achieves state-of-the-art performance across diverse benchmarks, exhibiting superior capability in handling extreme-quality content while ensuring robust generalization.
{"title":"RAM-VQA: Restoration Assisted Multi-modality Video Quality Assessment.","authors":"Pengfei Chen,Jiebin Yan,Rajiv Soundararajan,Giuseppe Valenzise,Cai Li,Leida Li","doi":"10.1109/tip.2026.3655117","DOIUrl":"https://doi.org/10.1109/tip.2026.3655117","url":null,"abstract":"Video Quality Assessment (VQA) strives to computationally emulate human perceptual judgments and has garnered significant attention given its widespread applicability. However, existing methodologies face two primary impediments: (1) limited proficiency in evaluating samples at quality extremes (e.g., severely degraded or near-perfect videos), and (2) insufficient sensitivity to nuanced quality variations arising from a misalignment with human perceptual mechanisms. Although vision-language models offer promising semantic understanding, their reliance on visual encoders pre-trained for high-level tasks often compromises their sensitivity to low-level distortions. To surmount these challenges, we propose the Restoration-Assisted Multi-modality VQA (RAM-VQA) framework. Uniquely, our approach leverages video restoration as a proxy to explicitly model distortion-sensitive features. The framework operates through two synergistic stages: a prompt learning stage that constructs a quality-aware textual space using triple-level references (degraded, restored, and pristine) derived from the restoration process, and a dual-branch evaluation stage that integrates semantic cues with technical quality indicators via spatio-temporal differential analysis. Extensive experiments demonstrate that RAM-VQA achieves state-of-the-art performance across diverse benchmarks, exhibiting superior capability in handling extreme-quality content while ensuring robust generalization.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"284 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3653212
Wang Liu,Zhuangzi Li,Ge Li,Siwei Ma,Sam Kwong,Wei Gao
The geometry-based point cloud compression algorithm achieves efficient compression and transmission for LiDAR point clouds with high sparsity. However, the low-bitrate mode results in severe geometry compression artifacts, which involve both point reduction and coordinate offset. To the best of our knowledge, this is the first attempt to directly enhance the geometry quality for compressed LiDAR point cloud (CLGE) in a post-processing manner. Our proposed method consists of two branches: cylindrical densification and adaptive refinement. The former adopts a multi-scale sparse convolution framework to effectively extract spatial features in the cylindrical coordinate system and generate dense candidate points quickly. Large asymmetric sparse convolution kernels are also designed to capture the shapes of different regions and objects. The latter branch refines the candidate points through several MLP layers, which takes the neighborhood features between the candidate points and the input points into account. Finally, the designed ring-based farthest point resampling serves as an effective alternative for achieving the target number while maintaining the geometry distribution. Extensive experiments conducted on several datasets verify the effectiveness of our approach under different compression artifact levels. Furthermore, our method is easily extended to upsampling and is robust to noise. In addition to the geometry signal quality improvement, the point cloud enhanced by our proposed method alleviates the performance degradation in object detection task due to compression distortion.
{"title":"Post-Processing Geometry Enhancement for G-PCC Compressed LiDAR via Cylindrical Densification.","authors":"Wang Liu,Zhuangzi Li,Ge Li,Siwei Ma,Sam Kwong,Wei Gao","doi":"10.1109/tip.2026.3653212","DOIUrl":"https://doi.org/10.1109/tip.2026.3653212","url":null,"abstract":"The geometry-based point cloud compression algorithm achieves efficient compression and transmission for LiDAR point clouds with high sparsity. However, the low-bitrate mode results in severe geometry compression artifacts, which involve both point reduction and coordinate offset. To the best of our knowledge, this is the first attempt to directly enhance the geometry quality for compressed LiDAR point cloud (CLGE) in a post-processing manner. Our proposed method consists of two branches: cylindrical densification and adaptive refinement. The former adopts a multi-scale sparse convolution framework to effectively extract spatial features in the cylindrical coordinate system and generate dense candidate points quickly. Large asymmetric sparse convolution kernels are also designed to capture the shapes of different regions and objects. The latter branch refines the candidate points through several MLP layers, which takes the neighborhood features between the candidate points and the input points into account. Finally, the designed ring-based farthest point resampling serves as an effective alternative for achieving the target number while maintaining the geometry distribution. Extensive experiments conducted on several datasets verify the effectiveness of our approach under different compression artifact levels. Furthermore, our method is easily extended to upsampling and is robust to noise. In addition to the geometry signal quality improvement, the point cloud enhanced by our proposed method alleviates the performance degradation in object detection task due to compression distortion.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"216 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/tip.2026.3654770
Can Peng,Manxin Chao,Ruoyu Li,Zaiqing Chen,Lijun Yun,Yuelong Xia
Using higher-resolution feature maps in the network is an effective approach for detecting small objects. However, high-resolution feature maps face the challenge of lacking semantic information. This has led previous methods to rely on downsampling feature maps, applying large-kernel convolution layers, and then upsampling the feature maps to obtain semantic information. However, these methods have certain limitations: first, large kernel convolutions in deeper layers typically provide significant global semantic information, but our experiments reveal that such prominent semantic information introduces background smear, which in turn leads to overfitting. Second, deep features often contain substantial redundant information, and the features of small objects are either minimal or have disappeared, which causes a degradation in detection performance when directly relying on deep features. To address these issues, we propose a high-resolution network based on local contextual semantics (HR-SemNet). The network is built on the proposed high-resolution backbone (HRB), which replaces the traditional backbone-FPN architecture by focusing all computational resources of large kernel convolutions on highresolution feature layers to capture clearer features of small objects. Additionally, a local context semantic module (LCSM) is employed to extract semantic information from the background, confining the semantic extraction to a local window to avoid interference from large-scale backgrounds and objects. HRSemNet decouples small object semantics from contextual semantics, with HRB and LCSM independently extracting these features. Extensive experiments and comprehensive evaluations on the VisDrone, AI-TOD, and TinyPerson datasets validate the effectiveness of the method. On the VisDrone dataset, which contains a large number of small objects, HR-SemNet improves the mean average precision (mAP) by 4.6%, reduces the computational cost (GFLOPs) by 49.9%, and decreases the parameter count by 94.9%.
{"title":"HR-SemNet: A High-Resolution Network for Enhanced Small Object Detection With Local Contextual Semantics.","authors":"Can Peng,Manxin Chao,Ruoyu Li,Zaiqing Chen,Lijun Yun,Yuelong Xia","doi":"10.1109/tip.2026.3654770","DOIUrl":"https://doi.org/10.1109/tip.2026.3654770","url":null,"abstract":"Using higher-resolution feature maps in the network is an effective approach for detecting small objects. However, high-resolution feature maps face the challenge of lacking semantic information. This has led previous methods to rely on downsampling feature maps, applying large-kernel convolution layers, and then upsampling the feature maps to obtain semantic information. However, these methods have certain limitations: first, large kernel convolutions in deeper layers typically provide significant global semantic information, but our experiments reveal that such prominent semantic information introduces background smear, which in turn leads to overfitting. Second, deep features often contain substantial redundant information, and the features of small objects are either minimal or have disappeared, which causes a degradation in detection performance when directly relying on deep features. To address these issues, we propose a high-resolution network based on local contextual semantics (HR-SemNet). The network is built on the proposed high-resolution backbone (HRB), which replaces the traditional backbone-FPN architecture by focusing all computational resources of large kernel convolutions on highresolution feature layers to capture clearer features of small objects. Additionally, a local context semantic module (LCSM) is employed to extract semantic information from the background, confining the semantic extraction to a local window to avoid interference from large-scale backgrounds and objects. HRSemNet decouples small object semantics from contextual semantics, with HRB and LCSM independently extracting these features. Extensive experiments and comprehensive evaluations on the VisDrone, AI-TOD, and TinyPerson datasets validate the effectiveness of the method. On the VisDrone dataset, which contains a large number of small objects, HR-SemNet improves the mean average precision (mAP) by 4.6%, reduces the computational cost (GFLOPs) by 49.9%, and decreases the parameter count by 94.9%.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"31 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1109/tip.2026.3654892
Xiang Yuan,Gong Cheng,Jiacheng Cheng,Ruixiang Yao,Junwei Han
Small object detection (SOD) constitutes a notable yet immensely arduous task, stemming from the restricted informative regions inherent in size-limited instances, which further sparks off heightened uncertainty beyond the capacity of current two-stage detectors. Specifically, the intrinsic ambiguity in small objects undermines the prevailing sampling paradigms and may mislead the model to devote futile effort to those unrecognizable targets, while the inconsistency of features utilized for the detection at two stages further exposes the hierarchical uncertainty. In this paper, we develop an Uncertainty learning framework for Small Object Detection, dubbed as Unc-SOD. By incorporating an auxiliary uncertainty branch to conventional Region Proposal Network (RPN), we model the indeterminacy at instance-level which later on serves as a surrogate criterion for sampling, thereby unearthing adequate candidates dynamically based on the varying degrees of uncertainty and facilitating the learning of proposal networks. In parallel, a Perception-and-Interaction strategy is devised to capture rich and discriminative representations, through optimizing the intrinsic properties from the regional features at the original pyramid and the assigned one, in which the perceptual process unfolds in a mutual paradigm. As the seminal attempt to model uncertainty in SOD task, our Unc-SOD yields state-of-the-art performance on two large-scale small object detection benchmarks, SODA-D and SODA-A, and the results on several SOD-oriented datasets including COCO, VisDrone, and Tsinghua-Tencent 100K also exhibit the promotion to baseline detector. This underscores the efficacy of our approach and its superiority over prevailing detectors when dealing with small instances.
{"title":"Unc-SOD: An Uncertainty Learning Framework for Small Object Detection.","authors":"Xiang Yuan,Gong Cheng,Jiacheng Cheng,Ruixiang Yao,Junwei Han","doi":"10.1109/tip.2026.3654892","DOIUrl":"https://doi.org/10.1109/tip.2026.3654892","url":null,"abstract":"Small object detection (SOD) constitutes a notable yet immensely arduous task, stemming from the restricted informative regions inherent in size-limited instances, which further sparks off heightened uncertainty beyond the capacity of current two-stage detectors. Specifically, the intrinsic ambiguity in small objects undermines the prevailing sampling paradigms and may mislead the model to devote futile effort to those unrecognizable targets, while the inconsistency of features utilized for the detection at two stages further exposes the hierarchical uncertainty. In this paper, we develop an Uncertainty learning framework for Small Object Detection, dubbed as Unc-SOD. By incorporating an auxiliary uncertainty branch to conventional Region Proposal Network (RPN), we model the indeterminacy at instance-level which later on serves as a surrogate criterion for sampling, thereby unearthing adequate candidates dynamically based on the varying degrees of uncertainty and facilitating the learning of proposal networks. In parallel, a Perception-and-Interaction strategy is devised to capture rich and discriminative representations, through optimizing the intrinsic properties from the regional features at the original pyramid and the assigned one, in which the perceptual process unfolds in a mutual paradigm. As the seminal attempt to model uncertainty in SOD task, our Unc-SOD yields state-of-the-art performance on two large-scale small object detection benchmarks, SODA-D and SODA-A, and the results on several SOD-oriented datasets including COCO, VisDrone, and Tsinghua-Tencent 100K also exhibit the promotion to baseline detector. This underscores the efficacy of our approach and its superiority over prevailing detectors when dealing with small instances.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.
{"title":"Video Decoupling Networks for Accurate, Efficient, Generalizable, and Robust Video Object Segmentation.","authors":"Jisheng Dang,Huicheng Zheng,Yulan Guo,Jianhuang Lai,Bin Hu,Tat-Seng Chua","doi":"10.1109/tip.2025.3649360","DOIUrl":"https://doi.org/10.1109/tip.2025.3649360","url":null,"abstract":"object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"66 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}