首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
SS-NeRF: Physically Based Sparse Spectral Rendering With Neural Radiance Field SS-NeRF:基于物理的稀疏光谱渲染与神经辐射场
IF 18.6 Pub Date : 2025-09-17 DOI: 10.1109/TPAMI.2025.3611376
Ru Li;Jia Liu;Guanghui Liu;Shengping Zhang;Bing Zeng;Shuaicheng Liu
In this paper, we propose SS-NeRF, the end-to-end Neural Radiance Field (NeRF)-based architectures for high-quality physically based rendering with sparse inputs. We modify the classical spectral rendering into two main steps, 1) the generation of a series of spectrum maps spanning different wavelengths, 2) the combination of these spectrum maps for the RGB output. The proposed architecture follows these two steps through the proposed multi-layer perceptron (MLP)-based architecture (SpectralMLP) and spectrum attention UNet (SAUNet). Given the ray origin and the ray direction, the SpectralMLP constructs the spectral radiance field to obtain spectrum maps of novel views, which are then sent to the SAUNet to produce RGB images of white-light illumination. Applying NeRF to build up the spectral rendering is a more physically-based way from the perspective of ray-tracing. Further, the spectral radiance fields decompose difficult scenes and improve the performance of NeRF-based methods. Previous baseline, such as SpectralNeRF, outperforms recent methods in synthesizing novel views but requires relatively dense viewpoints for accurate scene reconstruction. To tackle this, we propose SS-NeRF to enhance the detail of scene representation with sparse inputs. In SS-NeRF, we first design the depth-aware continuity to optimize the reconstruction based on single-view depth predictions. Then, the geometric-projected consistency is introduced to optimize the multi-view geometry alignment. Additionally, we introduce a superpixel-aligned consistency to ensure that the average color within each superpixel region remains consistent. Comprehensive experimental results demonstrate that the proposed method is superior to recent state-of-the-art methods when synthesizing new views on both synthetic and real-world datasets.
在本文中,我们提出了基于端到端神经辐射场(NeRF)的架构SS-NeRF,用于具有稀疏输入的高质量基于物理的渲染。我们将经典的光谱绘制修改为两个主要步骤,1)生成一系列跨越不同波长的光谱图,2)将这些光谱图组合为RGB输出。所提出的体系结构通过所提出的基于多层感知器(MLP)的体系结构(SpectralMLP)和频谱关注UNet (SAUNet)遵循这两个步骤。根据光线的来源和方向,SpectralMLP构建光谱辐射场,获得新视角的光谱图,然后将其发送给SAUNet,生成白光照明的RGB图像。从光线追踪的角度来看,应用NeRF来构建光谱渲染是一种更基于物理的方式。此外,光谱辐射场分解了困难场景,提高了基于nerf的方法的性能。以前的基线,如SpectralNeRF,在合成新视图方面优于最近的方法,但需要相对密集的视点才能准确地重建场景。为了解决这个问题,我们提出了SS-NeRF来增强使用稀疏输入的场景表示细节。在SS-NeRF中,我们首先设计了深度感知连续性来优化基于单视图深度预测的重建。然后,引入几何投影一致性来优化多视图几何对齐。此外,我们引入了超像素对齐一致性,以确保每个超像素区域内的平均颜色保持一致。综合实验结果表明,在合成和真实数据集上合成新视图时,所提出的方法优于当前最先进的方法。
{"title":"SS-NeRF: Physically Based Sparse Spectral Rendering With Neural Radiance Field","authors":"Ru Li;Jia Liu;Guanghui Liu;Shengping Zhang;Bing Zeng;Shuaicheng Liu","doi":"10.1109/TPAMI.2025.3611376","DOIUrl":"10.1109/TPAMI.2025.3611376","url":null,"abstract":"In this paper, we propose SS-NeRF, the end-to-end Neural Radiance Field (NeRF)-based architectures for high-quality physically based rendering with sparse inputs. We modify the classical spectral rendering into two main steps, 1) the generation of a series of spectrum maps spanning different wavelengths, 2) the combination of these spectrum maps for the RGB output. The proposed architecture follows these two steps through the proposed multi-layer perceptron (MLP)-based architecture (SpectralMLP) and spectrum attention UNet (SAUNet). Given the ray origin and the ray direction, the SpectralMLP constructs the spectral radiance field to obtain spectrum maps of novel views, which are then sent to the SAUNet to produce RGB images of white-light illumination. Applying NeRF to build up the spectral rendering is a more physically-based way from the perspective of ray-tracing. Further, the spectral radiance fields decompose difficult scenes and improve the performance of NeRF-based methods. Previous baseline, such as SpectralNeRF, outperforms recent methods in synthesizing novel views but requires relatively dense viewpoints for accurate scene reconstruction. To tackle this, we propose SS-NeRF to enhance the detail of scene representation with sparse inputs. In SS-NeRF, we first design the depth-aware continuity to optimize the reconstruction based on single-view depth predictions. Then, the geometric-projected consistency is introduced to optimize the multi-view geometry alignment. Additionally, we introduce a superpixel-aligned consistency to ensure that the average color within each superpixel region remains consistent. Comprehensive experimental results demonstrate that the proposed method is superior to recent state-of-the-art methods when synthesizing new views on both synthetic and real-world datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"1015-1028"},"PeriodicalIF":18.6,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145077467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LVOS: A Benchmark for Large-Scale Long-Term Video Object Segmentation LVOS:大规模长期视频目标分割的基准
IF 18.6 Pub Date : 2025-09-17 DOI: 10.1109/TPAMI.2025.3611020
Lingyi Hong;Zhongying Liu;Wenchao Chen;Chenzhi Tan;Yuang Feng;Xinyu Zhou;Pinxue Guo;Jinglun Li;Zhaoyu Chen;Shuyong Gao;Wei Zhang;Wenqiang Zhang
Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shelf VOS models, part of the existing VOS benchmarks mainly focuses on short-term videos, where objects remain visible most of the time. However, these benchmarks may not fully capture challenges encountered in practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average. Each video includes various attributes, especially challenges encountered in the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models’ performance in real scenarios. Based on LVOS, we evaluate 15 existing VOS models under 3 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that one of the significant factors contributing to accuracy decline is the increased video length, interacting with complex challenges such as long-term reappearance, cross-temporal confusion, and occlusion, which emphasize LVOS’s crucial role. We hope our LVOS can advance development of VOS in real scenes.
视频目标分割(VOS)的目的是识别和跟踪视频中的目标对象。尽管现成的VOS模型取得了出色的性能,但现有的VOS基准测试的一部分主要集中在短期视频上,其中大多数时间对象仍然可见。然而,这些基准可能无法完全捕捉到实际应用中遇到的挑战,并且缺乏长期数据集限制了在现实场景中进一步研究VOS。因此,我们提出了一个名为LVOS的新基准,该基准包含720个视频,296,401帧和407,945个高质量注释。LVOS视频平均时长1.14分钟。每个视频都包含各种属性,特别是在野外遇到的挑战,例如长期重现和跨时间相似的物体。与以前的基准测试相比,我们的LVOS更好地反映了VOS模型在真实场景中的性能。基于LVOS,我们对现有的15个VOS模型在3种不同设置下进行了评估,并进行了综合分析。在LVOS上,这些模型的性能会大幅下降,这凸显了在现实场景中实现精确跟踪和分割的挑战。基于属性的分析表明,导致准确性下降的重要因素之一是视频长度的增加,并与长期再现、跨时间混淆和遮挡等复杂挑战相互作用,这强调了LVOS的关键作用。我们希望我们的LVOS能够推动VOS在真实场景中的发展。
{"title":"LVOS: A Benchmark for Large-Scale Long-Term Video Object Segmentation","authors":"Lingyi Hong;Zhongying Liu;Wenchao Chen;Chenzhi Tan;Yuang Feng;Xinyu Zhou;Pinxue Guo;Jinglun Li;Zhaoyu Chen;Shuyong Gao;Wei Zhang;Wenqiang Zhang","doi":"10.1109/TPAMI.2025.3611020","DOIUrl":"10.1109/TPAMI.2025.3611020","url":null,"abstract":"Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shelf VOS models, part of the existing VOS benchmarks mainly focuses on short-term videos, where objects remain visible most of the time. However, these benchmarks may not fully capture challenges encountered in practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named <bold>LVOS</b>, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average. Each video includes various attributes, especially challenges encountered in the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models’ performance in real scenarios. Based on LVOS, we evaluate 15 existing VOS models under 3 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that one of the significant factors contributing to accuracy decline is the increased video length, interacting with complex challenges such as long-term reappearance, cross-temporal confusion, and occlusion, which emphasize LVOS’s crucial role. We hope our LVOS can advance development of VOS in real scenes.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"946-961"},"PeriodicalIF":18.6,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145077464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Defenses in Adversarial Machine Learning: A Systematic Survey From the Lifecycle Perspective 对抗性机器学习中的防御:从生命周期角度的系统调查
IF 18.6 Pub Date : 2025-09-17 DOI: 10.1109/TPAMI.2025.3611340
Baoyuan Wu;Mingli Zhu;Meixi Zheng;Zihao Zhu;Shaokui Wei;Mingda Zhang;Hongrui Chen;Danni Yuan;Li Liu;Qingshan Liu
Adversarial phenomena have been widely observed in machine learning (ML) systems, especially those using deep neural networks. These phenomena describe situations where ML systems may produce predictions that are inconsistent and incomprehensible to humans in certain specific cases. Such behavior poses a serious security threat to the practical application of ML systems. To exploit this vulnerability, several advanced attack paradigms have been developed, mainly including backdoor attacks, weight attacks, and adversarial examples. For each individual attack paradigm, various defense mechanisms have been proposed to enhance the robustness of models against the corresponding attacks. However, due to the independence and diversity of these defense paradigms, it is challenging to assess the overall robustness of an ML system against different attack paradigms. This survey aims to provide a systematic review of all existing defense paradigms from a unified lifecycle perspective. Specifically, we decompose a complete ML system into five stages: pre-training, training, post-training, deployment, and inference. We then present a clear taxonomy to categorize representative defense methods at each stage. The unified perspective and taxonomy not only help us analyze defense mechanisms but also enable us to understand the connections and differences among different defense paradigms. It inspires future research to develop more advanced and comprehensive defense strategies.
对抗现象在机器学习(ML)系统中已经被广泛观察到,特别是那些使用深度神经网络的系统。这些现象描述了机器学习系统在某些特定情况下可能产生不一致和人类无法理解的预测的情况。这种行为对机器学习系统的实际应用构成了严重的安全威胁。为了利用这个漏洞,已经开发了几种高级攻击范例,主要包括后门攻击、权重攻击和对抗性示例。针对每种单独的攻击范式,提出了各种防御机制来增强模型对相应攻击的鲁棒性。然而,由于这些防御范式的独立性和多样性,评估机器学习系统对不同攻击范式的整体鲁棒性是具有挑战性的。本调查旨在从统一的生命周期角度对所有现有的防御范例进行系统回顾。具体来说,我们将一个完整的机器学习系统分解为五个阶段:预训练、训练、后训练、部署和推理。然后,我们提出了一个明确的分类法,对每个阶段的代表性防御方法进行分类。统一的视角和分类不仅可以帮助我们分析防御机制,而且可以使我们了解不同防御范式之间的联系和差异。它启发了未来的研究,以制定更先进、更全面的防御战略。
{"title":"Defenses in Adversarial Machine Learning: A Systematic Survey From the Lifecycle Perspective","authors":"Baoyuan Wu;Mingli Zhu;Meixi Zheng;Zihao Zhu;Shaokui Wei;Mingda Zhang;Hongrui Chen;Danni Yuan;Li Liu;Qingshan Liu","doi":"10.1109/TPAMI.2025.3611340","DOIUrl":"10.1109/TPAMI.2025.3611340","url":null,"abstract":"Adversarial phenomena have been widely observed in machine learning (ML) systems, especially those using deep neural networks. These phenomena describe situations where ML systems may produce predictions that are inconsistent and incomprehensible to humans in certain specific cases. Such behavior poses a serious security threat to the practical application of ML systems. To exploit this vulnerability, several advanced attack paradigms have been developed, mainly including backdoor attacks, weight attacks, and adversarial examples. For each individual attack paradigm, various defense mechanisms have been proposed to enhance the robustness of models against the corresponding attacks. However, due to the independence and diversity of these defense paradigms, it is challenging to assess the overall robustness of an ML system against different attack paradigms. This survey aims to provide a systematic review of all existing defense paradigms from a unified lifecycle perspective. Specifically, we decompose a complete ML system into five stages: pre-training, training, post-training, deployment, and inference. We then present a clear taxonomy to categorize representative defense methods at each stage. The unified perspective and taxonomy not only help us analyze defense mechanisms but also enable us to understand the connections and differences among different defense paradigms. It inspires future research to develop more advanced and comprehensive defense strategies.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"876-895"},"PeriodicalIF":18.6,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145077466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ACLI: A CNN Pruning Framework Leveraging Adjacent Convolutional Layer Interdependence and $gamma$γ-Weakly Submodularity ACLI:利用相邻卷积层相互依赖和$gamma$-弱子模块化的CNN剪枝框架。
IF 18.6 Pub Date : 2025-09-16 DOI: 10.1109/TPAMI.2025.3610113
Sadegh Tofigh;Mohammad Askarizadeh;M. Omair Ahmad;M.N.S. Swamy;Kim Khoa Nguyen
Today, convolutional neural network (CNN) pruning techniques often rely on manually crafted importance criteria and pruning structures. Due to their heuristic nature, these methods may lack generality, and their performance is not guaranteed. In this paper, we propose a theoretical framework to address this challenge by leveraging the concept of $gamma$-weak submodularity, based on a new efficient importance function. By deriving an upper bound on the absolute error in the layer subsequent to the pruned layer, we formulate the importance function as a $gamma$-weakly submodular function. This formulation enables the development of an easy-to-implement, low-complexity, and data-free oblivious algorithm for selecting filters to be removed from a convolutional layer. Extensive experiments show that our method outperforms state-of-the-art benchmark networks across various datasets, with a computational cost comparable to the simplest pruning techniques, such as $l_{2}$-norm pruning. Notably, the proposed method achieves an accuracy of 76.52%, compared to 75.15% for the overall best baseline, with a 25.5% reduction in network parameters. According to our proposed resource-efficiency metric for pruning methods, the ACLI approach demonstrates orders-of-magnitude higher efficiency than the other baselines, while maintaining competitive accuracy.
今天,卷积神经网络(CNN)修剪技术通常依赖于手工制作的重要性标准和修剪结构。由于它们的启发式性质,这些方法可能缺乏通用性,并且它们的性能不能得到保证。在本文中,我们提出了一个理论框架,通过利用$gamma$-弱子模块化的概念,基于一个新的有效的重要性函数来解决这一挑战。通过推导修剪层之后的层的绝对误差的上界,我们将重要性函数表述为$gamma$-弱子模函数。该公式能够开发一种易于实现,低复杂性和无数据无关的算法,用于选择要从卷积层中删除的滤波器。大量的实验表明,我们的方法在各种数据集上优于最先进的基准网络,其计算成本与最简单的修剪技术(如$l_{2}$-norm修剪)相当。值得注意的是,该方法的准确率为76.52%,而总体最佳基线的准确率为75.15%,网络参数减少了25.5%。根据我们提出的修剪方法的资源效率指标,ACLI方法的效率比其他基线高出几个数量级,同时保持了竞争性的准确性。
{"title":"ACLI: A CNN Pruning Framework Leveraging Adjacent Convolutional Layer Interdependence and $gamma$γ-Weakly Submodularity","authors":"Sadegh Tofigh;Mohammad Askarizadeh;M. Omair Ahmad;M.N.S. Swamy;Kim Khoa Nguyen","doi":"10.1109/TPAMI.2025.3610113","DOIUrl":"10.1109/TPAMI.2025.3610113","url":null,"abstract":"Today, convolutional neural network (CNN) pruning techniques often rely on manually crafted importance criteria and pruning structures. Due to their heuristic nature, these methods may lack generality, and their performance is not guaranteed. In this paper, we propose a theoretical framework to address this challenge by leveraging the concept of <inline-formula><tex-math>$gamma$</tex-math></inline-formula>-weak submodularity, based on a new efficient importance function. By deriving an upper bound on the absolute error in the layer subsequent to the pruned layer, we formulate the importance function as a <inline-formula><tex-math>$gamma$</tex-math></inline-formula>-weakly submodular function. This formulation enables the development of an easy-to-implement, low-complexity, and data-free oblivious algorithm for selecting filters to be removed from a convolutional layer. Extensive experiments show that our method outperforms state-of-the-art benchmark networks across various datasets, with a computational cost comparable to the simplest pruning techniques, such as <inline-formula><tex-math>$l_{2}$</tex-math></inline-formula>-norm pruning. Notably, the proposed method achieves an accuracy of 76.52%, compared to 75.15% for the overall best baseline, with a 25.5% reduction in network parameters. According to our proposed resource-efficiency metric for pruning methods, the ACLI approach demonstrates orders-of-magnitude higher efficiency than the other baselines, while maintaining competitive accuracy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"932-945"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3D Hand Pose Estimation via Articulated Anchor-to-Joint 3D Local Regressors 基于关节锚点-关节三维局部回归量的三维手部姿态估计。
IF 18.6 Pub Date : 2025-09-16 DOI: 10.1109/TPAMI.2025.3609907
Changlong Jiang;Yang Xiao;Jinghong Zheng;Haohong Kuang;Cunlin Wu;Mingyang Zhang;Zhiguo Cao;Min Du;Joey Tianyi Zhou;Junsong Yuan
In this paper, we propose to address monocular 3D hand pose estimation from a single RGB or depth image via articulated anchor-to-joint 3D local regressors, in form of A2J-Transformer+. The key idea is to make the local regressors (i.e., anchor points) in 3D space be aware of hand’s local fine details and global articulated context jointly, to facilitate predicting their 3D offsets toward hand joints with linear weighted aggregation for joint localization. Our intuition is that, local fine details help to estimate accurate offset but may suffer from the issues including serious occlusion, confusing similar patterns, and overfitting risk. On the other hand, hand’s global articulated context can essentially provide additional descriptive clues and constraints to alleviate these issues. To set anchor points adaptively in 3D space, A2J-Transformer+ runs in a 2-stage manner. At the first stage, since the input modality property anchor points distribute more densely on X-Y plane, it leads to lower prediction accuracy along Z direction compared with those in the X and Y directions. To alleviate this, at the second stage anchor points are set near the joints yielded by the first stage evenly along X, Y, and Z directions. This treatment brings two main advantages: (1) balancing the prediction accuracy along X, Y, and Z directions, and (2) ensuring the anchor-joint offsets are of small values relatively easy to estimate. Wide-range experiments on three RGB hand datasets (InterHand2.6 M, HO-3D V2 and RHP) and three depth hand datasets (NYU, ICVL and HANDS 2017) verify A2J-Transformer+’s superiority and generalization ability for different modalities (i.e., RGB and depth) and hand cases (i.e., single hand, interacting hands, and hand-object interaction), even outperforming model-based manners. The test on ITOP dataset reveals that, A2J-Transformer+ can also be applied to 3D human pose estimation task.
在本文中,我们提出通过铰接的锚点到关节的3D局部回归器,以A2J-Transformer+的形式,从单个RGB或深度图像中解决单眼3D手部姿态估计问题。其关键思想是使三维空间中的局部回归量(即锚点)共同感知手部的局部精细细节和全局铰接点上下文,便于用线性加权聚合预测它们对手部关节的三维偏移量,从而实现关节定位。我们的直觉是,局部精细细节有助于估计准确的偏移量,但可能会受到严重遮挡、混淆相似模式和过度拟合风险等问题的影响。另一方面,hand的全局铰接上下文本质上可以提供额外的描述性线索和约束来缓解这些问题。为了在三维空间中自适应地设置锚点,aj - transformer +以两段方式运行。在第一阶段,由于输入模态属性锚点在X-Y平面上分布更密集,导致Z方向的预测精度低于X和Y方向的预测精度。为了缓解这一问题,在第二阶段,锚点沿X、Y和Z方向均匀地设置在第一阶段产生的节理附近。这种处理带来了两个主要优点:(1)平衡了沿X、Y和Z方向的预测精度;(2)确保锚-节点偏移量的值较小,相对容易估计。在三个RGB手部数据集(InterHand2.6M、HO-3D V2和RHP)和三个深度手部数据集(NYU、ICVL和HANDS 2017)上进行的大范围实验验证了A2J-Transformer+在不同模态(RGB和深度)和手部情况(单手、交互手和手物交互)下的优势和泛化能力,甚至优于基于模型的方式。在ITOP数据集上的测试表明,A2J-Transformer+也可以应用于三维人体姿态估计任务。源代码和支持材料将在验收后发布。
{"title":"3D Hand Pose Estimation via Articulated Anchor-to-Joint 3D Local Regressors","authors":"Changlong Jiang;Yang Xiao;Jinghong Zheng;Haohong Kuang;Cunlin Wu;Mingyang Zhang;Zhiguo Cao;Min Du;Joey Tianyi Zhou;Junsong Yuan","doi":"10.1109/TPAMI.2025.3609907","DOIUrl":"10.1109/TPAMI.2025.3609907","url":null,"abstract":"In this paper, we propose to address monocular 3D hand pose estimation from a single RGB or depth image via articulated anchor-to-joint 3D local regressors, in form of A2J-Transformer+. The key idea is to make the local regressors (i.e., anchor points) in 3D space be aware of hand’s local fine details and global articulated context jointly, to facilitate predicting their 3D offsets toward hand joints with linear weighted aggregation for joint localization. Our intuition is that, local fine details help to estimate accurate offset but may suffer from the issues including serious occlusion, confusing similar patterns, and overfitting risk. On the other hand, hand’s global articulated context can essentially provide additional descriptive clues and constraints to alleviate these issues. To set anchor points adaptively in 3D space, A2J-Transformer+ runs in a 2-stage manner. At the first stage, since the input modality property anchor points distribute more densely on X-Y plane, it leads to lower prediction accuracy along Z direction compared with those in the X and Y directions. To alleviate this, at the second stage anchor points are set near the joints yielded by the first stage evenly along X, Y, and Z directions. This treatment brings two main advantages: (1) balancing the prediction accuracy along X, Y, and Z directions, and (2) ensuring the anchor-joint offsets are of small values relatively easy to estimate. Wide-range experiments on three RGB hand datasets (InterHand2.6 M, HO-3D V2 and RHP) and three depth hand datasets (NYU, ICVL and HANDS 2017) verify A2J-Transformer+’s superiority and generalization ability for different modalities (i.e., RGB and depth) and hand cases (i.e., single hand, interacting hands, and hand-object interaction), even outperforming model-based manners. The test on ITOP dataset reveals that, A2J-Transformer+ can also be applied to 3D human pose estimation task.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"982-998"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Nearest Neighbor Search Using Dynamic Programming 基于动态规划的高效最近邻搜索
IF 18.6 Pub Date : 2025-09-16 DOI: 10.1109/TPAMI.2025.3610211
Pengfei Wang;Jiantao Song;Shiqing Xin;Shuangmin Chen;Changhe Tu;Wenping Wang;Jiaye Wang
Given a collection of points in $mathbb {R}^{3}$, KD-Tree and R-Tree are well-known nearest neighbor search (NNS) algorithms that rely on spatial partitioning and indexing techniques. However, when the query point is far from the data points or the data points inherently represent a 2-manifold surface, their query performance may degrade. To address this, we propose a novel dynamic programming technique that precomputes a Directed Acyclic Graph (DAG) to encode the proximity structure between data points. More specifically, the DAG captures how the proximity structure evolves during the incremental construction of the Voronoi diagram of the data points. Experimental results demonstrate that our method achieves a speed increase of 1-10x. Furthermore, our algorithm demonstrates significant practical value in diverse applications. We validated its effectiveness through extensive testing in four key applications: Point-to-Mesh Distance Queries, Iterative Closest Point (ICP) Registration, Density Peak Clustering, and Point-to-Segments Distance Queries. A particularly notable feature of our approach is its unique ability to efficiently identify the nearest neighbor among the first $k$ points in the point cloud, a capability that enables substantial acceleration in low-dimensional applications like Density Peak Clustering. As a natural extension of our incremental construction process, our method can also be readily adapted for farthest-point sampling tasks. These experimental results across multiple domains underscore the broad applicability and practical importance of our approach.
给定$mathbb {R}^{3}$中的点集合,KD-Tree和R- tree是众所周知的最近邻搜索(NNS)算法,它们依赖于空间分区和索引技术。然而,当查询点远离数据点或数据点本质上代表一个2流形曲面时,它们的查询性能可能会下降。为了解决这个问题,我们提出了一种新的动态规划技术,该技术预先计算一个有向无环图(DAG)来编码数据点之间的邻近结构。更具体地说,DAG捕获了在数据点的Voronoi图的增量构建过程中邻近结构的演变。实验结果表明,该方法的速度提高了1-10倍。此外,我们的算法在各种应用中显示出重要的实用价值。我们通过四个关键应用的广泛测试验证了其有效性:点到网格距离查询,迭代最近点(ICP)注册,密度峰值聚类和点到段距离查询。我们的方法的一个特别值得注意的特点是其独特的能力,可以有效地识别点云中前k个点中最近的邻居,这种能力可以在密度峰值聚类等低维应用程序中实现大幅加速。作为增量构建过程的自然扩展,我们的方法也可以很容易地适用于最远点采样任务。这些跨多个领域的实验结果强调了我们的方法的广泛适用性和实际重要性。
{"title":"Efficient Nearest Neighbor Search Using Dynamic Programming","authors":"Pengfei Wang;Jiantao Song;Shiqing Xin;Shuangmin Chen;Changhe Tu;Wenping Wang;Jiaye Wang","doi":"10.1109/TPAMI.2025.3610211","DOIUrl":"10.1109/TPAMI.2025.3610211","url":null,"abstract":"Given a collection of points in <inline-formula><tex-math>$mathbb {R}^{3}$</tex-math></inline-formula>, KD-Tree and R-Tree are well-known nearest neighbor search (NNS) algorithms that rely on spatial partitioning and indexing techniques. However, when the query point is far from the data points or the data points inherently represent a 2-manifold surface, their query performance may degrade. To address this, we propose a novel dynamic programming technique that precomputes a Directed Acyclic Graph (DAG) to encode the proximity structure between data points. More specifically, the DAG captures how the proximity structure evolves during the incremental construction of the Voronoi diagram of the data points. Experimental results demonstrate that our method achieves a speed increase of 1-10x. Furthermore, our algorithm demonstrates significant practical value in diverse applications. We validated its effectiveness through extensive testing in four key applications: Point-to-Mesh Distance Queries, Iterative Closest Point (ICP) Registration, Density Peak Clustering, and Point-to-Segments Distance Queries. A particularly notable feature of our approach is its unique ability to efficiently identify the nearest neighbor among the first <inline-formula><tex-math>$k$</tex-math></inline-formula> points in the point cloud, a capability that enables substantial acceleration in low-dimensional applications like Density Peak Clustering. As a natural extension of our incremental construction process, our method can also be readily adapted for farthest-point sampling tasks. These experimental results across multiple domains underscore the broad applicability and practical importance of our approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"999-1014"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145072881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generative Causality-Driven Network for Graph Multi-Task Learning 图多任务学习的生成因果驱动网络。
IF 18.6 Pub Date : 2025-09-16 DOI: 10.1109/TPAMI.2025.3610096
Xixun Lin;Qing Yu;Yanan Cao;Lixin Zou;Chuan Zhou;Jia Wu;Chenliang Li;Peng Zhang;Shirui Pan
Multi-task learning (MTL) is a standard learning paradigm in machine learning. The central idea of MTL is to capture the shared knowledge among multiple tasks for mitigating the problem of data sparsity where the annotated samples for each task are quite limited. Recent studies indicate that graph multi-task learning (GMTL) yields the promising improvement over previous MTL methods. GMTL represents tasks on a task relation graph, and further leverages graph neural networks (GNNs) to learn complex task relationships. Although GMTL achieves the better performance, the construction of task relation graph heavily depends on simple heuristic tricks, which results in the existence of spurious task correlations and the absence of true edges between tasks with strong connections. This problem largely limits the effectiveness of GMTL. To this end, we propose the Generative Causality-driven Network (GCNet), a novel framework that progressively learns the causal structure between tasks to discover which tasks are beneficial to be jointly trained for improving generalization ability and model robustness. To be specific, in the feature space, GCNet first introduces a feature-level generator to generate the structure prior for reducing learning difficulty. Afterwards, GCNet develops a output-level generator which is parameterized as a new causal energy-based model (EBM) to refine the learned structure prior in the output space driven by causality. Benefiting from our proposed causal framework, we theoretically derive an intervention contrastive estimation for training this causal EBM efficiently. Experiments are conducted on multiple synthetic and real-world datasets. Extensive empirical results and model analyses demonstrate the superior performance of GCNet over several competitive MTL baselines.
多任务学习(MTL)是机器学习领域的一种标准学习范式。MTL的中心思想是捕获多个任务之间的共享知识,以减轻数据稀疏性问题,其中每个任务的带注释的样本非常有限。最近的研究表明,图多任务学习(GMTL)比以前的多任务学习方法有了很大的改进。GMTL在任务关系图上表示任务,并进一步利用图神经网络(gnn)来学习复杂的任务关系。虽然GMTL的性能更好,但任务关系图的构建严重依赖于简单的启发式技巧,这导致具有强连接的任务之间存在虚假的任务相关性和缺乏真边。这个问题很大程度上限制了GMTL的有效性。为此,我们提出了生成因果关系驱动网络(GCNet),这是一个新的框架,它逐步学习任务之间的因果结构,以发现哪些任务有利于共同训练,以提高泛化能力和模型鲁棒性。具体来说,在特征空间中,GCNet首先引入了一个特征级生成器来提前生成结构,以降低学习难度。然后,GCNet开发了一个输出级生成器,该生成器被参数化为一种新的基于因果能的模型(EBM),以细化因果驱动的输出空间中的先验学习结构。从我们提出的因果框架中受益,我们从理论上推导出有效训练这种因果实证医学的干预对比估计。实验在多个合成和真实数据集上进行。广泛的实证结果和模型分析表明,GCNet在几个竞争性MTL基线上具有优越的性能。
{"title":"Generative Causality-Driven Network for Graph Multi-Task Learning","authors":"Xixun Lin;Qing Yu;Yanan Cao;Lixin Zou;Chuan Zhou;Jia Wu;Chenliang Li;Peng Zhang;Shirui Pan","doi":"10.1109/TPAMI.2025.3610096","DOIUrl":"10.1109/TPAMI.2025.3610096","url":null,"abstract":"Multi-task learning (MTL) is a standard learning paradigm in machine learning. The central idea of MTL is to capture the shared knowledge among multiple tasks for mitigating the problem of data sparsity where the annotated samples for each task are quite limited. Recent studies indicate that graph multi-task learning (GMTL) yields the promising improvement over previous MTL methods. GMTL represents tasks on a task relation graph, and further leverages graph neural networks (GNNs) to learn complex task relationships. Although GMTL achieves the better performance, the construction of task relation graph heavily depends on simple heuristic tricks, which results in the existence of spurious task correlations and the absence of true edges between tasks with strong connections. This problem largely limits the effectiveness of GMTL. To this end, we propose the Generative Causality-driven Network (GCNet), a novel framework that progressively learns the causal structure between tasks to discover which tasks are beneficial to be jointly trained for improving generalization ability and model robustness. To be specific, in the feature space, GCNet first introduces a feature-level generator to generate the structure prior for reducing learning difficulty. Afterwards, GCNet develops a output-level generator which is parameterized as a new causal energy-based model (EBM) to refine the learned structure prior in the output space driven by causality. Benefiting from our proposed causal framework, we theoretically derive an intervention contrastive estimation for training this causal EBM efficiently. Experiments are conducted on multiple synthetic and real-world datasets. Extensive empirical results and model analyses demonstrate the superior performance of GCNet over several competitive MTL baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"1029-1044"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSFA Image Denoising Using Physics-Based Noise Model and Noise-Decoupled Network 基于物理噪声模型和噪声解耦网络的MSFA图像去噪。
IF 18.6 Pub Date : 2025-09-16 DOI: 10.1109/TPAMI.2025.3610243
Yuqi Jiang;Ying Fu;Qiankun Liu;Jun Zhang
Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into SimpleDist component and ComplexDist component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.
多光谱滤波阵列(MSFA)相机以其体积小、拍摄速度快等优点得到越来越多的应用。然而,由于其窄带特性,它经常遭受光不足的问题,并且捕获的图像很容易被噪声淹没。神经网络作为一种常用的去噪方法,已经显示出其强大的去噪能力。然而,它们的性能高度依赖于高质量的无噪图像对。对于MSFA图像去噪任务,目前既没有配对的真实数据集,也没有精确的噪声模型能够生成逼真的噪声图像。为此,我们提出了一种基于物理的噪声模型,该模型能够匹配真实的噪声分布并合成逼真的噪声图像。在我们的噪声模型中,这些不同类型的噪声可以分为SimpleDist分量和ComplexDist分量。前者包含可以用高斯分布或泊松分布等简单概率分布描述的所有类型的噪声,后者包含无法用简单概率分布建模的复杂色偏噪声。此外,我们设计了一个由简单dist噪声去除网络(SNRNet)和复杂dist噪声去除网络(CNRNet)组成的噪声解耦网络,依次去除每个分量。此外,针对噪声模型中色差噪声的不均匀性,我们在CNRNet中引入了可学习的位置嵌入来表示位置信息。为了验证我们基于物理的噪声模型和噪声解耦网络的有效性,我们收集了一个真实的MSFA去噪数据集,其中包含配对的长时间曝光的干净图像和短时间曝光的噪声图像。实验证明,使用我们的噪声模型生成的合成数据训练的网络与使用配对真实数据训练的网络一样好,并且我们的噪声解耦网络优于其他最先进的去噪方法。项目页面可在https://github.com/ying-fu/msfa上找到。
{"title":"MSFA Image Denoising Using Physics-Based Noise Model and Noise-Decoupled Network","authors":"Yuqi Jiang;Ying Fu;Qiankun Liu;Jun Zhang","doi":"10.1109/TPAMI.2025.3610243","DOIUrl":"10.1109/TPAMI.2025.3610243","url":null,"abstract":"Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into <italic>SimpleDist</i> component and <italic>ComplexDist</i> component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"859-875"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation PlaneRecTR++:用于关节三维平面重建和姿态估计的统一查询学习。
IF 18.6 Pub Date : 2025-09-16 DOI: 10.1109/TPAMI.2025.3610500
Jingjia Shi;Shuaifeng Zhi;Kai Xu
The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.
从图像中重建三维平面通常可以分为平面检测、分割、参数回归和可能的每帧深度预测等几个子任务,以及帧之间的平面对应和相对相机姿态估计。以前的作品倾向于用不同的网络模块来划分和征服这些子任务,总体上采用两阶段范式。通过第一阶段提供的初始相机姿态和每帧平面预测,专门设计的模块(可能依赖于额外的平面对应标记)用于合并多视图平面实体并产生6DoF相机姿态。由于现有的工作都没有设法将上述密切相关的子任务集成到一个统一的框架中,而是分别和顺序地对待它们,我们怀疑这可能是现有方法性能限制的主要来源。基于这一发现以及基于查询的学习在丰富语义实体之间推理方面的成功,本文提出了PlaneRecTR++,这是一种基于transformer的架构,首次将与多视图重构和姿态估计相关的所有子任务统一为一个紧凑的单阶段模型,避免了初始姿态估计和面对应监督。大量的定量和定性实验表明,我们提出的统一学习在子任务之间实现了互利,在公共ScanNetv1, ScanNetv2, NYUv2-Plane和MatterPort3D数据集上获得了新的最先进的性能。
{"title":"PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation","authors":"Jingjia Shi;Shuaifeng Zhi;Kai Xu","doi":"10.1109/TPAMI.2025.3610500","DOIUrl":"10.1109/TPAMI.2025.3610500","url":null,"abstract":"The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"962-981"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Open-CRB: Toward Open World Active Learning for 3D Object Detection Open- crb:面向3D目标检测的开放世界主动学习
IF 18.6 Pub Date : 2025-09-12 DOI: 10.1109/TPAMI.2025.3575756
Zhuoxiao Chen;Yadan Luo;Zixin Wang;Zijian Wang;Zi Huang
LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.
基于激光雷达的3D目标检测最近通过主动学习(AL)取得了重大进展,通过在一小部分战略选择的点云上进行训练,获得了令人满意的性能。然而,在现实世界的部署中,流点云可能包含未知或新颖的对象,当前的人工智能方法捕获这些对象的能力仍未得到探索。本文研究了一个更实际和更具挑战性的研究任务:开放世界主动学习3D目标检测(OWAL-3D),旨在用新概念获取信息点云。为了应对这一挑战,我们提出了一种简单而有效的策略,称为开放标签简洁(OLC),它以最小的注释成本挖掘新的3D对象。我们的实证结果表明,OLC仅通过一轮选择就成功地使3D检测模型适应开放世界场景。然后,任何通用的人工智能策略都可以与提议的OLC集成,以有效地解决owl - 3d问题。在此基础上,我们引入了Open-CRB框架,该框架将OLC与我们专门为3D目标检测设计的初步人工智能方法CRB无缝集成。我们开发了一个全面的代码库,便于复制和未来的研究,支持15种基线方法(即主动学习,分布外检测和开放世界检测),2种类型的现代3D检测器(即一阶段SECOND和两阶段PV-RCNN)和3个基准3D数据集(即KITTI, nuScenes和Waymo)。大量的实验证明,与最先进的基线相比,所提出的Open-CRB在识别新类别和已知类别方面具有优势和灵活性,并且标签成本非常有限。
{"title":"Open-CRB: Toward Open World Active Learning for 3D Object Detection","authors":"Zhuoxiao Chen;Yadan Luo;Zixin Wang;Zijian Wang;Zi Huang","doi":"10.1109/TPAMI.2025.3575756","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3575756","url":null,"abstract":"LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8336-8350"},"PeriodicalIF":18.6,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1