首页 > 最新文献

IEEE Transactions on Image Processing最新文献

英文 中文
U-RWKV: Accurate and Efficient Volumetric Medical Image Segmentation via RWKV. U-RWKV:通过RWKV精确高效的体积医学图像分割。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-23 DOI: 10.1109/tip.2026.3654389
Hongyu Cai,Yifan Wang,Liu Wang,Jian Zhao,Zhejun Kuang
Accurate and efficient volumetric medical image segmentation is vital for clinical diagnosis, pre-operative planning, and disease-progression monitoring. Conventional convolutional neural networks (CNNs) struggle to capture long-range contextual information, whereas Transformer-based methods suffer from quadratic computational complexity, making it challenging to couple global modeling with high efficiency. To address these limitations, we explore an effective yet accurate segmentation model for volumetric data. Specifically, we introduce a novel linear-complexity sequence modeling technique, RWKV, and leverage it to design a Tri-directional Spatial Enhancement RWKV (TSE-R) block; this module performs global modeling via RWKV and incorporates two optimizations tailored to three-dimensional data: (1) a spatial-shift strategy that enlarges the local receptive field and facilitates inter-block interaction, thereby alleviating the structural information loss caused by sequence serialization; and (2) a tri-directional scanning mechanism that constructs sequences along three distinct directions, applies global modeling via WKV, and fuses them with learnable weights to preserve the inherent 3D spatial structure. Building upon the TSE-R block, we develop an end-to-end 3D segmentation network, termed U-RWKV, and extensive experiments on three public 3D medical segmentation benchmarks demonstrate that U-RWKV outperforms state-of-the-art CNN-, Transformer-, and Mamba-based counterparts, achieving a Dice score of 87.21% on the Synapse multi-organ abdominal dataset while reducing parameter count by a factor of 16.08 compared with leading methods.
准确、高效的体积医学图像分割对于临床诊断、术前计划和疾病进展监测至关重要。传统的卷积神经网络(cnn)难以捕获远程上下文信息,而基于transformer的方法的计算复杂度是二次的,这使得它难以将全局建模与高效率结合起来。为了解决这些限制,我们探索了一种有效而准确的体积数据分割模型。具体来说,我们引入了一种新的线性复杂度序列建模技术RWKV,并利用它设计了一个三向空间增强RWKV (TSE-R)块;该模块通过RWKV进行全局建模,并结合了针对三维数据的两种优化:(1)空间移位策略,扩大局部感受野,促进块间交互,从而减轻序列序列化带来的结构信息损失;(2)三向扫描机制,沿三个不同的方向构建序列,通过WKV进行全局建模,并将其与可学习的权值融合以保持固有的三维空间结构。在此基础上,我们开发了一个端到端3D分割网络,称为U-RWKV,并在三个公开的3D医学分割基准上进行了广泛的实验,证明U-RWKV优于最先进的基于CNN, Transformer和mamba的同类产品,在Synapse多器官腹部数据集中实现了87.21%的Dice得分,同时与领先的方法相比,参数数量减少了16.08倍。
{"title":"U-RWKV: Accurate and Efficient Volumetric Medical Image Segmentation via RWKV.","authors":"Hongyu Cai,Yifan Wang,Liu Wang,Jian Zhao,Zhejun Kuang","doi":"10.1109/tip.2026.3654389","DOIUrl":"https://doi.org/10.1109/tip.2026.3654389","url":null,"abstract":"Accurate and efficient volumetric medical image segmentation is vital for clinical diagnosis, pre-operative planning, and disease-progression monitoring. Conventional convolutional neural networks (CNNs) struggle to capture long-range contextual information, whereas Transformer-based methods suffer from quadratic computational complexity, making it challenging to couple global modeling with high efficiency. To address these limitations, we explore an effective yet accurate segmentation model for volumetric data. Specifically, we introduce a novel linear-complexity sequence modeling technique, RWKV, and leverage it to design a Tri-directional Spatial Enhancement RWKV (TSE-R) block; this module performs global modeling via RWKV and incorporates two optimizations tailored to three-dimensional data: (1) a spatial-shift strategy that enlarges the local receptive field and facilitates inter-block interaction, thereby alleviating the structural information loss caused by sequence serialization; and (2) a tri-directional scanning mechanism that constructs sequences along three distinct directions, applies global modeling via WKV, and fuses them with learnable weights to preserve the inherent 3D spatial structure. Building upon the TSE-R block, we develop an end-to-end 3D segmentation network, termed U-RWKV, and extensive experiments on three public 3D medical segmentation benchmarks demonstrate that U-RWKV outperforms state-of-the-art CNN-, Transformer-, and Mamba-based counterparts, achieving a Dice score of 87.21% on the Synapse multi-organ abdominal dataset while reducing parameter count by a factor of 16.08 compared with leading methods.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"39 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge-Prompted Trustworthy Disentangled Learning for Thyroid Ultrasound Segmentation with Limited Annotations. 知识提示的可信赖解纠缠学习用于有限注释的甲状腺超声分割。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-23 DOI: 10.1109/tip.2026.3654413
Wenxu Wang,Weizhen Wang,Qianjin Feng,Yu Zhang,Zhenyuan Ning
The similar textures, diverse shapes and blurred boundaries of thyroid lesions in ultrasound images pose a significant challenge to accurate segmentation. Although several methods have been proposed to alleviate the aforementioned issues, their generalization is hindered by limited annotation data and insufficient ability to distinguish lesion from its surrounding tissues, especially in the presence of noise and outlier. Additionally, most existing methods lack uncertainty estimation which is essential for providing trustworthy results and identifying potential mispredictions. To this end, we propose knowledge-prompted trustworthy disentangled learning (KPTD) for thyroid ultrasound segmentation with limited annotations. The proposed method consists of three key components: 1) Knowledge-aware prompt learning (KAPL) encodes TI-RADS reports into text features and introduces learnable prompts to extract contextual embeddings, which assist in generating region activation maps (serving as pseudo-labels for unlabeled images). 2) Foreground-background disentangled learning (FBDL) leverages region activation maps to disentangle foreground and background representations, refining their prototype distributions through a contrastive learning strategy to enhance the model's discrimination and robustness. 3) Foreground-background trustworthy fusion (FBTF) integrates the foreground and background representations and estimates their uncertainty based on evidence theory, providing trustworthy segmentation results. Experimental results show that KPTD achieves superior segmentation performance under limited annotations, significantly outperforming state-of-the-art methods.
超声图像中甲状腺病变的纹理相似,形状多样,边界模糊,给准确分割带来了很大的挑战。虽然已经提出了几种方法来缓解上述问题,但它们的推广受到有限的注释数据和区分病变与周围组织的能力不足的阻碍,特别是在存在噪声和离群值的情况下。此外,大多数现有方法缺乏不确定性估计,而不确定性估计对于提供可信的结果和识别潜在的错误预测至关重要。为此,我们提出了知识提示的可信解纠缠学习(KPTD)用于甲状腺超声分割有限注释。该方法由三个关键部分组成:1)知识感知提示学习(KAPL)将TI-RADS报告编码为文本特征,并引入可学习的提示来提取上下文嵌入,这有助于生成区域激活图(作为未标记图像的伪标签)。2)前景-背景解纠缠学习(FBDL)利用区域激活映射来解纠缠前景和背景表示,通过对比学习策略来细化它们的原型分布,以增强模型的辨识性和鲁棒性。3)前景-背景可信融合(FBTF)将前景和背景表示进行融合,并基于证据理论估计其不确定性,提供可信分割结果。实验结果表明,KPTD在有限的标注条件下取得了优异的分割性能,明显优于现有的分割方法。
{"title":"Knowledge-Prompted Trustworthy Disentangled Learning for Thyroid Ultrasound Segmentation with Limited Annotations.","authors":"Wenxu Wang,Weizhen Wang,Qianjin Feng,Yu Zhang,Zhenyuan Ning","doi":"10.1109/tip.2026.3654413","DOIUrl":"https://doi.org/10.1109/tip.2026.3654413","url":null,"abstract":"The similar textures, diverse shapes and blurred boundaries of thyroid lesions in ultrasound images pose a significant challenge to accurate segmentation. Although several methods have been proposed to alleviate the aforementioned issues, their generalization is hindered by limited annotation data and insufficient ability to distinguish lesion from its surrounding tissues, especially in the presence of noise and outlier. Additionally, most existing methods lack uncertainty estimation which is essential for providing trustworthy results and identifying potential mispredictions. To this end, we propose knowledge-prompted trustworthy disentangled learning (KPTD) for thyroid ultrasound segmentation with limited annotations. The proposed method consists of three key components: 1) Knowledge-aware prompt learning (KAPL) encodes TI-RADS reports into text features and introduces learnable prompts to extract contextual embeddings, which assist in generating region activation maps (serving as pseudo-labels for unlabeled images). 2) Foreground-background disentangled learning (FBDL) leverages region activation maps to disentangle foreground and background representations, refining their prototype distributions through a contrastive learning strategy to enhance the model's discrimination and robustness. 3) Foreground-background trustworthy fusion (FBTF) integrates the foreground and background representations and estimates their uncertainty based on evidence theory, providing trustworthy segmentation results. Experimental results show that KPTD achieves superior segmentation performance under limited annotations, significantly outperforming state-of-the-art methods.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"66 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topology-Guided Semantic Face Center Estimation for Rotation-Invariant Face Detection. 旋转不变性人脸检测的拓扑引导语义人脸中心估计。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-23 DOI: 10.1109/tip.2026.3654422
Hathai Kaewkorn,Lifang Zhou,Weisheng Li,Chengjiang Long
Face detection accuracy significantly decreases under rotational variations, including in-plane (RIP) and out-of-plane (ROP) rotations. ROP is particularly problematic due to its impact on landmark distortion, which leads to inaccurate face center localization. Meanwhile, many existing rotation-invariant models are primarily designed to handle RIP, they often fail under ROP because they lack the ability to capture semantic and topological relationships. Moreover, existing datasets frequently suffer from unreliable landmark annotations caused by imperfect ground truth labeling, the absence of precise center annotations, and imbalanced data across different rotation angles. To address these challenges, we propose a topology-guided semantic face center estimation method that leverages graph-based landmark relationships to preserve structural integrity under both RIP and ROP. Additionally, we construct a rotation-aware face dataset with accurate face center annotations and balanced rotational diversity to support training under extreme pose conditions. Next, we introduce a Hybrid-ViT model that fuses CNN spatial features with transformer-based global context and employ a center-guided module for robust landmark localization under extreme rotations. In order to evaluate center quality, we further design a hybrid metric that combines topological geometry with semantic perception for a more comprehensive evaluation of face center accuracy. Finally, experimental results demonstrate that our method outperforms state-of-the-art models in cross-dataset evaluations. Code: https://github.com/Catster111/TCE_RIFD.
在旋转变化下,包括面内旋转(RIP)和面外旋转(ROP),人脸检测精度显著降低。由于ROP对地标畸变的影响,导致人脸中心定位不准确,因此问题特别严重。同时,许多现有的旋转不变模型主要用于处理RIP,由于缺乏捕获语义和拓扑关系的能力,它们经常在ROP下失败。此外,由于地面真值标注不完善、缺乏精确的中心标注以及不同旋转角度的数据不平衡等原因,现有数据集的地标标注往往不可靠。为了解决这些挑战,我们提出了一种拓扑导向的语义面中心估计方法,该方法利用基于图的地标关系来保持RIP和ROP下的结构完整性。此外,我们构建了一个旋转感知的人脸数据集,该数据集具有准确的人脸中心注释和平衡的旋转多样性,以支持极端姿势条件下的训练。接下来,我们引入了一个Hybrid-ViT模型,该模型融合了CNN空间特征和基于变压器的全局上下文,并采用中心引导模块在极端旋转下进行鲁棒地标定位。为了评估人脸中心质量,我们进一步设计了一种结合拓扑几何和语义感知的混合度量,以更全面地评估人脸中心精度。最后,实验结果表明,我们的方法在跨数据集评估中优于最先进的模型。代码:https://github.com/Catster111/TCE_RIFD。
{"title":"Topology-Guided Semantic Face Center Estimation for Rotation-Invariant Face Detection.","authors":"Hathai Kaewkorn,Lifang Zhou,Weisheng Li,Chengjiang Long","doi":"10.1109/tip.2026.3654422","DOIUrl":"https://doi.org/10.1109/tip.2026.3654422","url":null,"abstract":"Face detection accuracy significantly decreases under rotational variations, including in-plane (RIP) and out-of-plane (ROP) rotations. ROP is particularly problematic due to its impact on landmark distortion, which leads to inaccurate face center localization. Meanwhile, many existing rotation-invariant models are primarily designed to handle RIP, they often fail under ROP because they lack the ability to capture semantic and topological relationships. Moreover, existing datasets frequently suffer from unreliable landmark annotations caused by imperfect ground truth labeling, the absence of precise center annotations, and imbalanced data across different rotation angles. To address these challenges, we propose a topology-guided semantic face center estimation method that leverages graph-based landmark relationships to preserve structural integrity under both RIP and ROP. Additionally, we construct a rotation-aware face dataset with accurate face center annotations and balanced rotational diversity to support training under extreme pose conditions. Next, we introduce a Hybrid-ViT model that fuses CNN spatial features with transformer-based global context and employ a center-guided module for robust landmark localization under extreme rotations. In order to evaluate center quality, we further design a hybrid metric that combines topological geometry with semantic perception for a more comprehensive evaluation of face center accuracy. Finally, experimental results demonstrate that our method outperforms state-of-the-art models in cross-dataset evaluations. Code: https://github.com/Catster111/TCE_RIFD.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"2 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RAM-VQA: Restoration Assisted Multi-modality Video Quality Assessment. RAM-VQA:恢复辅助多模态视频质量评估。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-23 DOI: 10.1109/tip.2026.3655117
Pengfei Chen,Jiebin Yan,Rajiv Soundararajan,Giuseppe Valenzise,Cai Li,Leida Li
Video Quality Assessment (VQA) strives to computationally emulate human perceptual judgments and has garnered significant attention given its widespread applicability. However, existing methodologies face two primary impediments: (1) limited proficiency in evaluating samples at quality extremes (e.g., severely degraded or near-perfect videos), and (2) insufficient sensitivity to nuanced quality variations arising from a misalignment with human perceptual mechanisms. Although vision-language models offer promising semantic understanding, their reliance on visual encoders pre-trained for high-level tasks often compromises their sensitivity to low-level distortions. To surmount these challenges, we propose the Restoration-Assisted Multi-modality VQA (RAM-VQA) framework. Uniquely, our approach leverages video restoration as a proxy to explicitly model distortion-sensitive features. The framework operates through two synergistic stages: a prompt learning stage that constructs a quality-aware textual space using triple-level references (degraded, restored, and pristine) derived from the restoration process, and a dual-branch evaluation stage that integrates semantic cues with technical quality indicators via spatio-temporal differential analysis. Extensive experiments demonstrate that RAM-VQA achieves state-of-the-art performance across diverse benchmarks, exhibiting superior capability in handling extreme-quality content while ensuring robust generalization.
视频质量评估(VQA)努力在计算上模拟人类的感知判断,并因其广泛的适用性而获得了极大的关注。然而,现有的方法面临两个主要障碍:(1)在质量极端情况下评估样本的熟练程度有限(例如,严重退化或接近完美的视频),以及(2)对与人类感知机制不一致引起的细微质量变化的敏感性不足。尽管视觉语言模型提供了很有前途的语义理解,但它们对高级任务预训练的视觉编码器的依赖往往会损害它们对低级扭曲的敏感性。为了克服这些挑战,我们提出了恢复辅助多模态VQA (RAM-VQA)框架。独特的是,我们的方法利用视频恢复作为代理来显式地建模失真敏感特征。该框架通过两个协同阶段运行:一个是快速学习阶段,该阶段使用源自恢复过程的三级参考(退化、恢复和原始)构建质量感知的文本空间;另一个是双分支评估阶段,该阶段通过时空差异分析将语义线索与技术质量指标整合在一起。大量的实验表明,RAM-VQA在不同的基准测试中实现了最先进的性能,在处理高质量内容方面表现出卓越的能力,同时确保了鲁棒的泛化。
{"title":"RAM-VQA: Restoration Assisted Multi-modality Video Quality Assessment.","authors":"Pengfei Chen,Jiebin Yan,Rajiv Soundararajan,Giuseppe Valenzise,Cai Li,Leida Li","doi":"10.1109/tip.2026.3655117","DOIUrl":"https://doi.org/10.1109/tip.2026.3655117","url":null,"abstract":"Video Quality Assessment (VQA) strives to computationally emulate human perceptual judgments and has garnered significant attention given its widespread applicability. However, existing methodologies face two primary impediments: (1) limited proficiency in evaluating samples at quality extremes (e.g., severely degraded or near-perfect videos), and (2) insufficient sensitivity to nuanced quality variations arising from a misalignment with human perceptual mechanisms. Although vision-language models offer promising semantic understanding, their reliance on visual encoders pre-trained for high-level tasks often compromises their sensitivity to low-level distortions. To surmount these challenges, we propose the Restoration-Assisted Multi-modality VQA (RAM-VQA) framework. Uniquely, our approach leverages video restoration as a proxy to explicitly model distortion-sensitive features. The framework operates through two synergistic stages: a prompt learning stage that constructs a quality-aware textual space using triple-level references (degraded, restored, and pristine) derived from the restoration process, and a dual-branch evaluation stage that integrates semantic cues with technical quality indicators via spatio-temporal differential analysis. Extensive experiments demonstrate that RAM-VQA achieves state-of-the-art performance across diverse benchmarks, exhibiting superior capability in handling extreme-quality content while ensuring robust generalization.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"284 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Post-Processing Geometry Enhancement for G-PCC Compressed LiDAR via Cylindrical Densification. 基于圆柱形致密化的G-PCC压缩激光雷达后处理几何增强。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-23 DOI: 10.1109/tip.2026.3653212
Wang Liu,Zhuangzi Li,Ge Li,Siwei Ma,Sam Kwong,Wei Gao
The geometry-based point cloud compression algorithm achieves efficient compression and transmission for LiDAR point clouds with high sparsity. However, the low-bitrate mode results in severe geometry compression artifacts, which involve both point reduction and coordinate offset. To the best of our knowledge, this is the first attempt to directly enhance the geometry quality for compressed LiDAR point cloud (CLGE) in a post-processing manner. Our proposed method consists of two branches: cylindrical densification and adaptive refinement. The former adopts a multi-scale sparse convolution framework to effectively extract spatial features in the cylindrical coordinate system and generate dense candidate points quickly. Large asymmetric sparse convolution kernels are also designed to capture the shapes of different regions and objects. The latter branch refines the candidate points through several MLP layers, which takes the neighborhood features between the candidate points and the input points into account. Finally, the designed ring-based farthest point resampling serves as an effective alternative for achieving the target number while maintaining the geometry distribution. Extensive experiments conducted on several datasets verify the effectiveness of our approach under different compression artifact levels. Furthermore, our method is easily extended to upsampling and is robust to noise. In addition to the geometry signal quality improvement, the point cloud enhanced by our proposed method alleviates the performance degradation in object detection task due to compression distortion.
基于几何的点云压缩算法实现了对高稀疏度LiDAR点云的高效压缩和传输。然而,低比特率模式会导致严重的几何压缩伪影,包括点减少和坐标偏移。据我们所知,这是第一次以后处理方式直接提高压缩激光雷达点云(CLGE)几何质量的尝试。我们提出的方法包括两个分支:圆柱形致密化和自适应细化。前者采用多尺度稀疏卷积框架,在柱坐标系中有效提取空间特征,快速生成密集候选点;设计了大型非对称稀疏卷积核,用于捕获不同区域和物体的形状。后一分支考虑候选点与输入点之间的邻域特征,通过多个MLP层对候选点进行细化。最后,设计的基于环的最远点重采样作为一种有效的替代方案,可以在保持几何分布的同时获得目标数。在多个数据集上进行的大量实验验证了我们的方法在不同压缩伪影级别下的有效性。此外,我们的方法很容易扩展到上采样,并且对噪声具有鲁棒性。在提高几何信号质量的同时,本文方法增强的点云也缓解了由于压缩失真导致的目标检测任务性能下降的问题。
{"title":"Post-Processing Geometry Enhancement for G-PCC Compressed LiDAR via Cylindrical Densification.","authors":"Wang Liu,Zhuangzi Li,Ge Li,Siwei Ma,Sam Kwong,Wei Gao","doi":"10.1109/tip.2026.3653212","DOIUrl":"https://doi.org/10.1109/tip.2026.3653212","url":null,"abstract":"The geometry-based point cloud compression algorithm achieves efficient compression and transmission for LiDAR point clouds with high sparsity. However, the low-bitrate mode results in severe geometry compression artifacts, which involve both point reduction and coordinate offset. To the best of our knowledge, this is the first attempt to directly enhance the geometry quality for compressed LiDAR point cloud (CLGE) in a post-processing manner. Our proposed method consists of two branches: cylindrical densification and adaptive refinement. The former adopts a multi-scale sparse convolution framework to effectively extract spatial features in the cylindrical coordinate system and generate dense candidate points quickly. Large asymmetric sparse convolution kernels are also designed to capture the shapes of different regions and objects. The latter branch refines the candidate points through several MLP layers, which takes the neighborhood features between the candidate points and the input points into account. Finally, the designed ring-based farthest point resampling serves as an effective alternative for achieving the target number while maintaining the geometry distribution. Extensive experiments conducted on several datasets verify the effectiveness of our approach under different compression artifact levels. Furthermore, our method is easily extended to upsampling and is robust to noise. In addition to the geometry signal quality improvement, the point cloud enhanced by our proposed method alleviates the performance degradation in object detection task due to compression distortion.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"216 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HR-SemNet: A High-Resolution Network for Enhanced Small Object Detection With Local Contextual Semantics. HR-SemNet:基于局部上下文语义的高分辨率增强小目标检测网络。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-23 DOI: 10.1109/tip.2026.3654770
Can Peng,Manxin Chao,Ruoyu Li,Zaiqing Chen,Lijun Yun,Yuelong Xia
Using higher-resolution feature maps in the network is an effective approach for detecting small objects. However, high-resolution feature maps face the challenge of lacking semantic information. This has led previous methods to rely on downsampling feature maps, applying large-kernel convolution layers, and then upsampling the feature maps to obtain semantic information. However, these methods have certain limitations: first, large kernel convolutions in deeper layers typically provide significant global semantic information, but our experiments reveal that such prominent semantic information introduces background smear, which in turn leads to overfitting. Second, deep features often contain substantial redundant information, and the features of small objects are either minimal or have disappeared, which causes a degradation in detection performance when directly relying on deep features. To address these issues, we propose a high-resolution network based on local contextual semantics (HR-SemNet). The network is built on the proposed high-resolution backbone (HRB), which replaces the traditional backbone-FPN architecture by focusing all computational resources of large kernel convolutions on highresolution feature layers to capture clearer features of small objects. Additionally, a local context semantic module (LCSM) is employed to extract semantic information from the background, confining the semantic extraction to a local window to avoid interference from large-scale backgrounds and objects. HRSemNet decouples small object semantics from contextual semantics, with HRB and LCSM independently extracting these features. Extensive experiments and comprehensive evaluations on the VisDrone, AI-TOD, and TinyPerson datasets validate the effectiveness of the method. On the VisDrone dataset, which contains a large number of small objects, HR-SemNet improves the mean average precision (mAP) by 4.6%, reduces the computational cost (GFLOPs) by 49.9%, and decreases the parameter count by 94.9%.
在网络中使用高分辨率的特征映射是检测小目标的有效方法。然而,高分辨率特征图面临缺乏语义信息的挑战。这导致以前的方法依赖于对特征映射进行下采样,应用大核卷积层,然后对特征映射进行上采样以获得语义信息。然而,这些方法有一定的局限性:首先,深层的大核卷积通常提供重要的全局语义信息,但我们的实验表明,这种突出的语义信息引入了背景涂抹,从而导致过拟合。其次,深度特征往往包含大量冗余信息,而小目标的特征要么极小,要么已经消失,直接依赖深度特征会导致检测性能下降。为了解决这些问题,我们提出了一个基于本地上下文语义的高分辨率网络(HR-SemNet)。该网络建立在提出的高分辨率骨干(HRB)基础上,取代了传统的骨干- fpn架构,将大核卷积的所有计算资源集中在高分辨率特征层上,以捕获更清晰的小目标特征。此外,采用局部上下文语义模块(local context semantic module, LCSM)从背景中提取语义信息,将语义提取限制在局部窗口内,避免了大规模背景和物体的干扰。HRSemNet将小对象语义与上下文语义解耦,HRB和LCSM独立提取这些特征。在VisDrone、AI-TOD和TinyPerson数据集上进行的大量实验和综合评估验证了该方法的有效性。在包含大量小目标的VisDrone数据集上,HR-SemNet的平均精度(mAP)提高了4.6%,计算成本(GFLOPs)降低了49.9%,参数计数减少了94.9%。
{"title":"HR-SemNet: A High-Resolution Network for Enhanced Small Object Detection With Local Contextual Semantics.","authors":"Can Peng,Manxin Chao,Ruoyu Li,Zaiqing Chen,Lijun Yun,Yuelong Xia","doi":"10.1109/tip.2026.3654770","DOIUrl":"https://doi.org/10.1109/tip.2026.3654770","url":null,"abstract":"Using higher-resolution feature maps in the network is an effective approach for detecting small objects. However, high-resolution feature maps face the challenge of lacking semantic information. This has led previous methods to rely on downsampling feature maps, applying large-kernel convolution layers, and then upsampling the feature maps to obtain semantic information. However, these methods have certain limitations: first, large kernel convolutions in deeper layers typically provide significant global semantic information, but our experiments reveal that such prominent semantic information introduces background smear, which in turn leads to overfitting. Second, deep features often contain substantial redundant information, and the features of small objects are either minimal or have disappeared, which causes a degradation in detection performance when directly relying on deep features. To address these issues, we propose a high-resolution network based on local contextual semantics (HR-SemNet). The network is built on the proposed high-resolution backbone (HRB), which replaces the traditional backbone-FPN architecture by focusing all computational resources of large kernel convolutions on highresolution feature layers to capture clearer features of small objects. Additionally, a local context semantic module (LCSM) is employed to extract semantic information from the background, confining the semantic extraction to a local window to avoid interference from large-scale backgrounds and objects. HRSemNet decouples small object semantics from contextual semantics, with HRB and LCSM independently extracting these features. Extensive experiments and comprehensive evaluations on the VisDrone, AI-TOD, and TinyPerson datasets validate the effectiveness of the method. On the VisDrone dataset, which contains a large number of small objects, HR-SemNet improves the mean average precision (mAP) by 4.6%, reduces the computational cost (GFLOPs) by 49.9%, and decreases the parameter count by 94.9%.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"31 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unc-SOD: An Uncertainty Learning Framework for Small Object Detection. Unc-SOD:小目标检测的不确定性学习框架。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-22 DOI: 10.1109/tip.2026.3654892
Xiang Yuan,Gong Cheng,Jiacheng Cheng,Ruixiang Yao,Junwei Han
Small object detection (SOD) constitutes a notable yet immensely arduous task, stemming from the restricted informative regions inherent in size-limited instances, which further sparks off heightened uncertainty beyond the capacity of current two-stage detectors. Specifically, the intrinsic ambiguity in small objects undermines the prevailing sampling paradigms and may mislead the model to devote futile effort to those unrecognizable targets, while the inconsistency of features utilized for the detection at two stages further exposes the hierarchical uncertainty. In this paper, we develop an Uncertainty learning framework for Small Object Detection, dubbed as Unc-SOD. By incorporating an auxiliary uncertainty branch to conventional Region Proposal Network (RPN), we model the indeterminacy at instance-level which later on serves as a surrogate criterion for sampling, thereby unearthing adequate candidates dynamically based on the varying degrees of uncertainty and facilitating the learning of proposal networks. In parallel, a Perception-and-Interaction strategy is devised to capture rich and discriminative representations, through optimizing the intrinsic properties from the regional features at the original pyramid and the assigned one, in which the perceptual process unfolds in a mutual paradigm. As the seminal attempt to model uncertainty in SOD task, our Unc-SOD yields state-of-the-art performance on two large-scale small object detection benchmarks, SODA-D and SODA-A, and the results on several SOD-oriented datasets including COCO, VisDrone, and Tsinghua-Tencent 100K also exhibit the promotion to baseline detector. This underscores the efficacy of our approach and its superiority over prevailing detectors when dealing with small instances.
小目标检测(SOD)是一项值得注意但非常艰巨的任务,它源于尺寸有限的实例中固有的有限信息区域,这进一步引发了当前两级检测器无法承受的高度不确定性。具体来说,小目标的固有模糊性破坏了主流的采样范式,并可能误导模型对那些无法识别的目标投入无用的努力,而两个阶段用于检测的特征的不一致性进一步暴露了层次的不确定性。在本文中,我们开发了一个用于小目标检测的不确定性学习框架,称为Unc-SOD。通过在传统的区域建议网络(RPN)中加入一个辅助的不确定性分支,我们在实例级对不确定性进行建模,然后将其作为抽样的替代标准,从而根据不同程度的不确定性动态挖掘适当的候选对象,促进建议网络的学习。同时,设计了感知与交互策略,通过优化原始金字塔和指定金字塔的区域特征的内在属性来捕获丰富的判别表征,其中感知过程以相互范式展开。作为对SOD任务不确定性建模的开创性尝试,我们的Unc-SOD在两个大规模小目标检测基准(SODA-D和SODA-A)上产生了最先进的性能,并且在几个面向SOD的数据集(包括COCO, VisDrone和Tsinghua-Tencent 100K)上的结果也显示出向基线检测器的提升。这强调了我们的方法的有效性,以及它在处理小实例时优于现有检测器的优越性。
{"title":"Unc-SOD: An Uncertainty Learning Framework for Small Object Detection.","authors":"Xiang Yuan,Gong Cheng,Jiacheng Cheng,Ruixiang Yao,Junwei Han","doi":"10.1109/tip.2026.3654892","DOIUrl":"https://doi.org/10.1109/tip.2026.3654892","url":null,"abstract":"Small object detection (SOD) constitutes a notable yet immensely arduous task, stemming from the restricted informative regions inherent in size-limited instances, which further sparks off heightened uncertainty beyond the capacity of current two-stage detectors. Specifically, the intrinsic ambiguity in small objects undermines the prevailing sampling paradigms and may mislead the model to devote futile effort to those unrecognizable targets, while the inconsistency of features utilized for the detection at two stages further exposes the hierarchical uncertainty. In this paper, we develop an Uncertainty learning framework for Small Object Detection, dubbed as Unc-SOD. By incorporating an auxiliary uncertainty branch to conventional Region Proposal Network (RPN), we model the indeterminacy at instance-level which later on serves as a surrogate criterion for sampling, thereby unearthing adequate candidates dynamically based on the varying degrees of uncertainty and facilitating the learning of proposal networks. In parallel, a Perception-and-Interaction strategy is devised to capture rich and discriminative representations, through optimizing the intrinsic properties from the regional features at the original pyramid and the assigned one, in which the perceptual process unfolds in a mutual paradigm. As the seminal attempt to model uncertainty in SOD task, our Unc-SOD yields state-of-the-art performance on two large-scale small object detection benchmarks, SODA-D and SODA-A, and the results on several SOD-oriented datasets including COCO, VisDrone, and Tsinghua-Tencent 100K also exhibit the promotion to baseline detector. This underscores the efficacy of our approach and its superiority over prevailing detectors when dealing with small instances.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Video Decoupling Networks for Accurate, Efficient, Generalizable, and Robust Video Object Segmentation. 视频解耦网络用于准确、高效、可推广和鲁棒的视频对象分割。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-21 DOI: 10.1109/tip.2025.3649360
Jisheng Dang,Huicheng Zheng,Yulan Guo,Jianhuang Lai,Bin Hu,Tat-Seng Chua
object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.
目标分割(VOS)是视频分析中的一项基本任务,旨在准确识别和分割视频序列中感兴趣的目标。传统方法依靠记忆网络存储单帧外观特征,在计算效率和有效捕获动态视觉信息方面面临挑战。为了解决这些限制,我们提出了一个具有每个片段内存更新机制的视频解耦网络(VDN)。我们的方法受到人类视觉皮层双流假说的启发,并将多个先前的视频帧分解为基本元素:场景、运动和实例。提出了统一先验时空解耦算法(Unified Prior-based spatial -temporal decoupling, UPSD),该算法将多帧图像统一解析为基本元素。随着时间的推移,UPSD持续存储元素,支持基于任务需求的不同线索的自适应集成。这种分解机制促进了全面的时空信息捕获和快速更新,从而显著提高了VOS的整体性能。在多个VOS基准测试上进行的大量实验验证了我们的方法的最先进的准确性、效率、通用性和鲁棒性。值得注意的是,在多个VOS基准测试中,与以前最先进的方法相比,VDN表现出了显著的性能改进和显著的速度提升。它在域移下也表现出良好的泛化性和对各种噪声类型的鲁棒性。
{"title":"Video Decoupling Networks for Accurate, Efficient, Generalizable, and Robust Video Object Segmentation.","authors":"Jisheng Dang,Huicheng Zheng,Yulan Guo,Jianhuang Lai,Bin Hu,Tat-Seng Chua","doi":"10.1109/tip.2025.3649360","DOIUrl":"https://doi.org/10.1109/tip.2025.3649360","url":null,"abstract":"object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"66 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Transactions on Image Processing publication information IEEE图像处理汇刊信息
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tip.2026.3651208
{"title":"IEEE Transactions on Image Processing publication information","authors":"","doi":"10.1109/tip.2026.3651208","DOIUrl":"https://doi.org/10.1109/tip.2026.3651208","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"58 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios 增强细分任何模型来概括视觉上不显著的场景
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tip.2026.3651951
Guangqian Guo, Pengfei Chen, Yong Guo, Huafeng Chen, Boqiang Zhang, Shan Gao
{"title":"Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios","authors":"Guangqian Guo, Pengfei Chen, Yong Guo, Huafeng Chen, Boqiang Zhang, Shan Gao","doi":"10.1109/tip.2026.3651951","DOIUrl":"https://doi.org/10.1109/tip.2026.3651951","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"26 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Image Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1