首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
Neural Eigenfunctions Are Structured Representation Learners. 神经特征函数是结构化表征学习器。
IF 18.6 Pub Date : 2025-10-27 DOI: 10.1109/TPAMI.2025.3625728
Zhijie Deng, Jiaxin Shi, Hao Zhang, Peng Cui, Cewu Lu, Jun Zhu

This paper revisits the canonical concept of learning structured representations without label supervision by eigendecomposition. Yet, unlike prior spectral methods such as Laplacian Eigenmap which operate in a nonparametric manner, we aim to parametrically model the principal eigenfunctions of an integral operator defined by a kernel and a data distribution using a neural network for enhanced scalability and reasonable out-of-sample generalization. To achieve this goal, we first present a new series of objective functions that generalize the EigenGame [1] to function space for learning neural eigenfunctions. We then show that, when the similarity metric is derived from positive relations in a data augmentation setup, a representation learning objective function that resembles those of popular self-supervised learning methods emerges, with an additional symmetry-breaking property for producing structured representations where features are ordered by importance. We call such a structured, adaptive-length deep representation Neural Eigenmap. We demonstrate using Neural Eigenmap as adaptive-length codes in image retrieval systems. By truncation according to feature importance, our method requires up to $16times$ shorter representation length than leading self-supervised learning ones to achieve similar retrieval performance. We further apply our method to graph data and report strong results on a node representation learning benchmark with more than one million nodes.

本文回顾了通过特征分解学习无标签监督的结构化表示的规范概念。然而,与先前的频谱方法(如拉普拉斯特征映射)以非参数方式操作不同,我们的目标是使用神经网络对由核和数据分布定义的积分算子的主特征函数进行参数化建模,以增强可扩展性和合理的样本外泛化。为了实现这一目标,我们首先提出了一系列新的目标函数,将EigenGame[1]推广到用于学习神经特征函数的函数空间。然后,我们表明,当相似性度量从数据增强设置中的正关系中导出时,出现了类似于流行的自监督学习方法的表征学习目标函数,并具有额外的对称性破坏性质,用于生成结构化表征,其中特征按重要性排序。我们称这种结构化的、自适应长度的深度表示为神经特征图。我们演示了在图像检索系统中使用神经特征映射作为自适应长度代码。通过根据特征重要性进行截断,我们的方法比领先的自监督学习方法所需的表示长度缩短了16倍,以达到相似的检索性能。我们进一步将我们的方法应用于图数据,并在超过一百万个节点的节点表示学习基准上报告了强有力的结果。
{"title":"Neural Eigenfunctions Are Structured Representation Learners.","authors":"Zhijie Deng, Jiaxin Shi, Hao Zhang, Peng Cui, Cewu Lu, Jun Zhu","doi":"10.1109/TPAMI.2025.3625728","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3625728","url":null,"abstract":"<p><p>This paper revisits the canonical concept of learning structured representations without label supervision by eigendecomposition. Yet, unlike prior spectral methods such as Laplacian Eigenmap which operate in a nonparametric manner, we aim to parametrically model the principal eigenfunctions of an integral operator defined by a kernel and a data distribution using a neural network for enhanced scalability and reasonable out-of-sample generalization. To achieve this goal, we first present a new series of objective functions that generalize the EigenGame [1] to function space for learning neural eigenfunctions. We then show that, when the similarity metric is derived from positive relations in a data augmentation setup, a representation learning objective function that resembles those of popular self-supervised learning methods emerges, with an additional symmetry-breaking property for producing structured representations where features are ordered by importance. We call such a structured, adaptive-length deep representation Neural Eigenmap. We demonstrate using Neural Eigenmap as adaptive-length codes in image retrieval systems. By truncation according to feature importance, our method requires up to $16times$ shorter representation length than leading self-supervised learning ones to achieve similar retrieval performance. We further apply our method to graph data and report strong results on a node representation learning benchmark with more than one million nodes.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Affine Correspondences between Multi-Camera Systems for Relative Pose Estimation. 用于相对姿态估计的多相机系统间的仿射对应。
IF 18.6 Pub Date : 2025-10-27 DOI: 10.1109/TPAMI.2025.3626134
Banglei Guan, Ji Zhao

We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem-where the scale transformation of multi-camera systems is unknown-we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy. Source code is available at https://github.com/jizhaox/relpose-mcs-depth.

提出了一种利用两个仿射对应(ACs)计算多相机系统相对位姿的新方法。现有的多相机相对姿态估计方法要么局限于运动的特殊情况,要么计算复杂度太高,要么需要太多的点对应(pc)。因此,当使用RANSAC作为鲁棒估计器时,这些解算器阻碍了有效或准确的相对姿态估计。本文表明,当使用特殊的参数化方法利用ACs和多相机系统之间的几何约束时,使用ACs的6自由度相对姿态估计问题允许一个可行的最小解。我们提出了一个基于两个ac的问题公式,该ac包含跨两个视图的两种常见类型的ac,即相机间和相机内。此外,我们还开发了一个统一的通用框架来生成6自由度求解器。在此基础上,我们使用此框架来处理两类实际场景。首先,对于更具挑战性的7DOF相对姿态估计问题(其中多相机系统的尺度变换未知),我们提出了使用三个ac计算相对姿态和尺度的7DOF解算器。其次,利用惯性测量单元(imu),我们引入了约束相对姿态估计问题的几个最小解。这包括已知相对旋转角的5DOF解算器和已知垂直方向的4DOF解算器。在虚拟和真实多相机系统上的实验证明,该算法比现有算法更有效,同时产生了更好的相对姿态精度。源代码可从https://github.com/jizhaox/relpose-mcs-depth获得。
{"title":"Affine Correspondences between Multi-Camera Systems for Relative Pose Estimation.","authors":"Banglei Guan, Ji Zhao","doi":"10.1109/TPAMI.2025.3626134","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3626134","url":null,"abstract":"<p><p>We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem-where the scale transformation of multi-camera systems is unknown-we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy. Source code is available at https://github.com/jizhaox/relpose-mcs-depth.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-resolution open-vocabulary object 6D pose estimation. 高分辨率开放词汇对象6D姿态估计。
IF 18.6 Pub Date : 2025-10-23 DOI: 10.1109/TPAMI.2025.3624589
Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi

The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota Light, Linemod, and YCB-Video. Our method achieves state-of the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.

在6D姿态估计任务中推广到不可见物体是非常具有挑战性的。虽然视觉语言模型(VLMs)可以使用自然语言描述来支持对未见物体的6D姿态估计,但与基于模型的方法相比,这些解决方案的性能较差。在这项工作中,我们提出了Horyon,一个基于开放词汇表vmm的架构,它解决了一个看不见的物体的两个场景之间的相对姿态估计,仅由文本提示描述。我们利用文本提示来识别场景中看不见的物体,从而获得高分辨率的多尺度特征。这些特征用于提取跨场景匹配进行配准。我们在四个数据集(REAL275、Toyota Light、Linemod和YCB-Video)上对我们的模型进行了基准测试。我们的方法在所有数据集上都达到了最先进的性能,在平均召回率上比之前表现最好的方法高出12.6。
{"title":"High-resolution open-vocabulary object 6D pose estimation.","authors":"Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi","doi":"10.1109/TPAMI.2025.3624589","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3624589","url":null,"abstract":"<p><p>The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota Light, Linemod, and YCB-Video. Our method achieves state-of the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMC++: Masked Learning of Unsupervised Video Semantic Compression. 无监督视频语义压缩的掩膜学习。
IF 18.6 Pub Date : 2025-10-23 DOI: 10.1109/TPAMI.2025.3625063
Yuan Tian, Xiaoyue Ling, Cong Geng, Qiang Hu, Guo Lu, Guangtao Zha

Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.

大多数视频压缩方法关注的是人的视觉感知,而忽略了语义的保存。这将导致压缩过程中严重的语义丢失,阻碍后续的视频分析任务。在本文中,我们提出了一个屏蔽视频建模(MVM)驱动的压缩框架,该框架通过以自监督的方式联合挖掘和压缩语义,特别保留了视频语义。虽然MVM精通通过掩码补丁预测任务学习可泛化语义,但它也可能编码非语义信息,如琐碎的纹理细节,浪费比特成本并带来语义噪声。为了抑制这种情况,我们在MVM令牌空间中显式正则化压缩视频的非语义熵。该框架被实例化为一个简单的语义挖掘-压缩(SMC)模型。此外,我们从几个方面将SMC扩展为先进的smc++模型。首先,我们为它配备了一个掩蔽的运动预测目标,从而提高了它的时间语义学习能力。其次,我们引入了一个基于transformer的压缩模块,以提高语义压缩的效率。考虑到在不同编码阶段直接挖掘异构特征之间的复杂冗余是非常重要的,我们引入了一个紧凑的蓝图语义表示来将这些特征对齐到一个相似的形式,充分释放基于transformer的压缩模块的力量。广泛的结果表明,在三个视频分析任务和七个数据集上,所提出的SMC和SMC++模型比以前传统的、可学习的、面向感知质量的视频编解码器具有显著的优势。
{"title":"SMC++: Masked Learning of Unsupervised Video Semantic Compression.","authors":"Yuan Tian, Xiaoyue Ling, Cong Geng, Qiang Hu, Guo Lu, Guangtao Zha","doi":"10.1109/TPAMI.2025.3625063","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3625063","url":null,"abstract":"<p><p>Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models. 研究光学像差对图像分类和目标检测模型的影响。
IF 18.6 Pub Date : 2025-10-16 DOI: 10.1109/TPAMI.2025.3622234
Patrick Muller, Alexander Braun, Margret Keuper
<p><p>Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur. Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that, in many cases, models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse common corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses with diverse aberrations, qualities, and types. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur. In addition, we show on ImageNet-100 with our OpticsAugment framework that robustness can be increased by using optical kernels as data augmentation. Compared to a conventionally trained ResNeXt50, training wit
深度神经网络(dnn)已被证明在各种计算机视觉应用中是成功的,例如模型甚至可以在安全关键情况下进行推断。因此,视觉模型必须对噪声或模糊等干扰具有鲁棒性。虽然存在一些重要的基准来评估模型对各种损坏的鲁棒性,但模糊通常以过于简单的方式来近似模拟散焦,而忽略了光学系统产生的不同模糊核形状。为了研究模型对真实光学模糊效果的鲁棒性,本文提出了两个模糊腐蚀数据集,分别称为OpticsBench和LensCorruptions。OpticsBench检查初级像差,如彗差、散焦和像散,即像差可以通过改变泽尼克多项式的单个参数来表示。为了超越初级像差的原则但合成设置,LensCorruptions在泽尼克多项式所跨越的向量空间中采样线性组合,对应于100个真实镜头。在ImageNet和MSCOCO上对图像分类和目标检测的评估表明,对于各种不同的预训练模型,OpticsBench和LensCorruptions上的性能差异很大,这表明需要考虑真实的图像损坏来评估模型对模糊的鲁棒性。深度神经网络(dnn)已被证明在各种计算机视觉应用中是成功的,例如,在许多情况下,模型甚至可以在安全关键情况下进行推断。因此,视觉模型必须对噪声或模糊等干扰具有鲁棒性。虽然存在一些重要的基准来评估模型对各种常见损坏的鲁棒性,但模糊通常以过于简单的方式近似模拟散焦,而忽略了光学系统产生的不同模糊核形状。为了研究模型对真实光学模糊效果的鲁棒性,本文提出了两个模糊腐蚀数据集,分别称为OpticsBench和LensCorruptions。OpticsBench检查初级像差,如彗差、散焦和像散,即像差可以通过改变泽尼克多项式的单个参数来表示。为了超越初级像差的原则但合成设置,LensCorruptions在泽尼克多项式跨越的向量空间中采样线性组合,对应于100个具有不同像差,质量和类型的真实镜头。在ImageNet和MSCOCO上对图像分类和目标检测的评估表明,对于各种不同的预训练模型,OpticsBench和LensCorruptions上的性能差异很大,这表明需要考虑真实的图像损坏来评估模型对模糊的鲁棒性。此外,我们在ImageNet-100上展示了我们的OpticsAugment框架,通过使用光学核作为数据增强可以增加鲁棒性。与传统训练的ResNeXt50相比,使用OpticsAugment训练在OpticsBench上实现了21.7%的平均性能提升,在2D常见损坏上实现了6.8%的平均性能提升。
{"title":"Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models.","authors":"Patrick Muller, Alexander Braun, Margret Keuper","doi":"10.1109/TPAMI.2025.3622234","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3622234","url":null,"abstract":"&lt;p&gt;&lt;p&gt;Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur. Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that, in many cases, models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse common corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses with diverse aberrations, qualities, and types. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur. In addition, we show on ImageNet-100 with our OpticsAugment framework that robustness can be increased by using optical kernels as data augmentation. Compared to a conventionally trained ResNeXt50, training wit","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145310346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-And Knowledge-Driven Visual Abductive Reasoning 数据和知识驱动的视觉溯因推理
IF 18.6 Pub Date : 2025-09-23 DOI: 10.1109/TPAMI.2025.3613712
Chen Liang;Wenguan Wang;Ling Chen;Yi Yang
Abductive reasoning seeks the likeliest possible explanation for partial observations. Although being frequently employed in human daily reasoning, abduction is rarely explored in computer vision literature. In this article, we propose a new task, Visual Abductive Reasoning (VAR), that underpins the machine intelligence study of abductive reasoning in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the observed premise. We create the first large-scale VAR dataset, which contains a total of 9K examples. We further devise a transformer-based VAR model – Reasonerv2 – for knowledge-driven, causal-and-cascaded reasoning. Reasonerv2 first adopts a contextualized directional position embedding strategy in the encoder, to capture the causal-related temporal structure of the observations, and yield discriminative representations for the premises and hypotheses. Then, Reasonerv2 extracts condensed causal knowledge from external knowledge bases, for reasoning beyond observation. Finally, Reasonerv2 cascades multiple decoders so as to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that Reasonerv2 surpasses many famous video-language models, while still being far behind human performance.
溯因推理寻求对部分观察最可能的解释。虽然在人类日常推理中经常使用,但在计算机视觉文献中很少探索溯因。在本文中,我们提出了一个新的任务,视觉溯因推理(VAR),它支持在日常视觉情况下溯因推理的机器智能研究。给定一组不完整的视觉事件,AI系统不仅需要描述所观察到的内容,还需要推断出最能解释所观察到的前提的假设。我们创建了第一个大规模的VAR数据集,它总共包含9K个样本。我们进一步设计了一个基于变压器的VAR模型——Reasonerv2——用于知识驱动、因果级联推理。Reasonerv2首先在编码器中采用情境化的定向位置嵌入策略,捕捉观测值的因果相关时间结构,并对前提和假设产生判别表示。然后,Reasonerv2从外部知识库中提取浓缩的因果知识,进行超越观察的推理。最后,Reasonerv2级联多个解码器,生成并逐步提炼前提和假设句。在级联推理过程中,句子的预测分数用于指导跨句子的信息流。我们的VAR基准测试结果表明,Reasonerv2超越了许多著名的视频语言模型,但仍远远落后于人类的表现。
{"title":"Data-And Knowledge-Driven Visual Abductive Reasoning","authors":"Chen Liang;Wenguan Wang;Ling Chen;Yi Yang","doi":"10.1109/TPAMI.2025.3613712","DOIUrl":"10.1109/TPAMI.2025.3613712","url":null,"abstract":"Abductive reasoning seeks the likeliest possible explanation for partial observations. Although being frequently employed in human daily reasoning, abduction is rarely explored in computer vision literature. In this article, we propose a new task, Visual Abductive Reasoning (VAR), that underpins the machine intelligence study of abductive reasoning in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the observed premise. We create the first large-scale VAR dataset, which contains a total of 9K examples. We further devise a transformer-based VAR model – <sc>Reasoner</small>v2 – for knowledge-driven, causal-and-cascaded reasoning. <sc>Reasoner</small>v2 first adopts a contextualized directional position embedding strategy in the encoder, to capture the causal-related temporal structure of the observations, and yield discriminative representations for the premises and hypotheses. Then, <sc>Reasoner</small>v2 extracts condensed causal knowledge from external knowledge bases, for reasoning beyond observation. Finally, <sc>Reasoner</small>v2 cascades multiple decoders so as to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that <sc>Reasoner</small>v2 surpasses many famous video-language models, while still being far behind human performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"792-806"},"PeriodicalIF":18.6,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145127459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MGAF: LiDAR-Camera 3D Object Detection With Multiple Guidance and Adaptive Fusion MGAF:激光雷达-相机三维目标检测与多制导和自适应融合。
IF 18.6 Pub Date : 2025-09-22 DOI: 10.1109/TPAMI.2025.3612958
Baojie Fan;Xiaotian Li;Yuhan Zhou;Caixia Xia;Huijie Fan;Fengyu Xu;Jiandong Tian
Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird’s-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, with multi-guided global interaction and LiDAR-guided adaptive fusion, named MGAF. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth and spatial information. The designed semantic segmentation network captures category and orientation prior information for raw point clouds. In the following, an Adaptive Fusion Dual Transformer (AFDT) is developed to adaptively enhance the interaction of different modal BEV features from both global and bidirectional perspectives. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed in order to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. Notably, the proposed AFDT is general, which also shows superior performance on other models. Our framework has undergone extensive experimentation on the large-scale nuScenes dataset, Waymo Open Dataset, and long-range Argoverse2 dataset, consistently demonstrating state-of-the-art performance.
近年来,基于鸟瞰图(BEV)视角的三维多模态目标检测方法取得了显著进展。然而,它们大多忽略了激光雷达与相机之间的互补交互和引导。在这项工作中,我们提出了一种新的多模态三维目标检测方法,称为MGAF,该方法具有多制导全局交互和激光雷达制导自适应融合。具体来说,我们引入了稀疏深度制导(SDG)和激光雷达占用制导(LOG)来生成具有足够深度和空间信息的3D特征。所设计的语义分割网络捕获原始点云的类别和方向先验信息。本文开发了一种自适应融合双变压器(AFDT),从全局和双向角度自适应增强不同模式BEV特征的相互作用。同时,设计了稀疏高度压缩附加下采样和多尺度双径变压器(MSDPT),以扩大不同模态特征的接收场。最后,引入时间融合模块对前一帧的特征进行聚合。值得注意的是,所提出的AFDT是通用的,在其他模型上也表现出优越的性能。我们的框架已经在大规模nuScenes数据集、Waymo开放数据集和远程Argoverse2数据集上进行了广泛的实验,始终显示出最先进的性能。代码将在https://github.com/xioatian1/MGAF上发布。3D目标检测,多模态,多制导,自适应融合,BEV表示,自动驾驶。
{"title":"MGAF: LiDAR-Camera 3D Object Detection With Multiple Guidance and Adaptive Fusion","authors":"Baojie Fan;Xiaotian Li;Yuhan Zhou;Caixia Xia;Huijie Fan;Fengyu Xu;Jiandong Tian","doi":"10.1109/TPAMI.2025.3612958","DOIUrl":"10.1109/TPAMI.2025.3612958","url":null,"abstract":"Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird’s-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, with multi-guided global interaction and LiDAR-guided adaptive fusion, named MGAF. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth and spatial information. The designed semantic segmentation network captures category and orientation prior information for raw point clouds. In the following, an Adaptive Fusion Dual Transformer (AFDT) is developed to adaptively enhance the interaction of different modal BEV features from both global and bidirectional perspectives. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed in order to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. Notably, the proposed AFDT is general, which also shows superior performance on other models. Our framework has undergone extensive experimentation on the large-scale nuScenes dataset, Waymo Open Dataset, and long-range Argoverse2 dataset, consistently demonstrating state-of-the-art performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"824-839"},"PeriodicalIF":18.6,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145117074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Optimal Mixture of Experts System for 3D Object Detection: A Game of Accuracy, Efficiency and Adaptivity 面向三维目标检测的最优混合专家系统:精度、效率和适应性的博弈。
IF 18.6 Pub Date : 2025-09-22 DOI: 10.1109/TPAMI.2025.3611795
Linshen Liu;Pu Wang;Guanlin Wu;Junyue Jiang;Hao Frank Yang
Autonomous vehicles, open-world robots, and other automated systems rely on accurate, efficient perception modules for real-time object detection. Although high-precision models improve reliability, their processing time and computational overhead can hinder real-time performance and raise safety concerns. This paper introduces an Edge-based Mixture-of-Experts Optimal Sensing (EMOS) System that addresses the challenge of co-achieving accuracy, latency and scene adaptivity, further demonstrated in the open-world autonomous driving scenarios. Algorithmically, EMOS fuses multimodal sensor streams via an Adaptive Multimodal Data Bridge and uses a scenario-aware MoE switch to activate only a complementary set of specialized experts as needed. The proposed hierarchical backpropagation and a multiscale pooling layer let model capacity scale with real-world demand complexity. System-wise, an edge-optimized runtime with accelerator-aware scheduling (e.g., ONNX/TensorRT), zero-copy buffering, and overlapped I/O–compute enforces explicit latency/accuracy budgets across diverse driving conditions. Experimental results establish EMOS as the new state of the art: on KITTI, it increases average AP by 3.17% while running $2.6times$ faster on Nvidia Jetson. On nuScenes, it improves accuracy by 0.2% mAP and 0.5% NDS, with 34% fewer parameters and a $15.35times$ Nvidia Jetson speedup. Leveraging multimodal data and intelligent experts cooperation, EMOS delivers accurate, efficient and edge-adaptive perception system for autonomous vehicles, thereby ensuring robust, timely responses in real-world scenarios.
自动驾驶汽车、开放世界机器人和其他自动化系统依赖于精确、高效的感知模块来进行实时目标检测。虽然高精度模型提高了可靠性,但它们的处理时间和计算开销可能会阻碍实时性能并引起安全问题。本文介绍了一种基于边缘的专家混合最优感知(EMOS)系统,该系统解决了共同实现准确性、延迟和场景适应性的挑战,并在开放世界自动驾驶场景中得到了进一步证明。从算法上讲,EMOS通过自适应多模态数据桥融合多模态传感器流,并使用场景感知MoE开关,根据需要仅激活一组补充的专业专家。所提出的分层反向传播和多尺度池化层使模型的容量随实际需求的复杂性而扩展。在系统方面,边缘优化的运行时具有加速器感知调度(例如,ONNX/TensorRT),零拷贝缓冲和重叠I/ o计算,可以在不同的驾驶条件下强制执行显式的延迟/精度预算。实验结果表明EMOS是最新的技术:在KITTI上,它将平均AP提高了3.17%,而在Nvidia Jetson上运行速度提高了2.6倍。在nuScenes上,它的准确率提高了0.2% mAP和0.5% NDS,参数减少了34%,加速速度是Nvidia Jetson的15.35倍。EMOS利用多模式数据和智能专家合作,为自动驾驶汽车提供准确、高效和边缘自适应的感知系统,从而确保在现实场景中做出强大、及时的响应。
{"title":"Toward Optimal Mixture of Experts System for 3D Object Detection: A Game of Accuracy, Efficiency and Adaptivity","authors":"Linshen Liu;Pu Wang;Guanlin Wu;Junyue Jiang;Hao Frank Yang","doi":"10.1109/TPAMI.2025.3611795","DOIUrl":"10.1109/TPAMI.2025.3611795","url":null,"abstract":"Autonomous vehicles, open-world robots, and other automated systems rely on accurate, efficient perception modules for real-time object detection. Although high-precision models improve reliability, their processing time and computational overhead can hinder real-time performance and raise safety concerns. This paper introduces an <i>Edge-based Mixture-of-Experts Optimal Sensing</i> (<i>EMOS</i>) System that addresses the challenge of co-achieving accuracy, latency and scene adaptivity, further demonstrated in the open-world autonomous driving scenarios. Algorithmically, <i>EMOS</i> fuses multimodal sensor streams via an Adaptive Multimodal Data Bridge and uses a scenario-aware MoE switch to activate only a complementary set of specialized experts as needed. The proposed hierarchical backpropagation and a multiscale pooling layer let model capacity scale with real-world demand complexity. System-wise, an edge-optimized runtime with accelerator-aware scheduling (e.g., ONNX/TensorRT), zero-copy buffering, and overlapped I/O–compute enforces explicit latency/accuracy budgets across diverse driving conditions. Experimental results establish <i>EMOS</i> as the new state of the art: on KITTI, it increases average AP by 3.17% while running <inline-formula><tex-math>$2.6times$</tex-math></inline-formula> faster on Nvidia Jetson. On nuScenes, it improves accuracy by 0.2% mAP and 0.5% NDS, with 34% fewer parameters and a <inline-formula><tex-math>$15.35times$</tex-math></inline-formula> Nvidia Jetson speedup. Leveraging multimodal data and intelligent experts cooperation, <i>EMOS</i> delivers accurate, efficient and edge-adaptive perception system for autonomous vehicles, thereby ensuring robust, timely responses in real-world scenarios.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"914-931"},"PeriodicalIF":18.6,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145117071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pathway-Aware Multimodal Transformer (PAMT): Integrating Pathological Image and Gene Expression for Interpretable Cancer Survival Analysis 通路感知多模态转换器(PAMT):整合病理图像和基因表达用于可解释的癌症生存分析。
IF 18.6 Pub Date : 2025-09-18 DOI: 10.1109/TPAMI.2025.3611531
Rui Yan;Xueyuan Zhang;Zihang Jiang;Baizhi Wang;Xiuwu Bian;Fei Ren;S. Kevin Zhou
Integrating multimodal data of pathological image and gene expression for cancer survival analysis can achieve better results than using a single modality. However, existing multimodal learning methods ignore fine-grained interactions between both modalities, especially the interactions between biological pathways and pathological image patches. In this article, we propose a novel Pathway-Aware Multimodal Transformer (PAMT) framework for interpretable cancer survival analysis. Specifically, the PAMT learns fine-grained modality interaction through three stages: (1) In the intra-modal pathway-pathway / patch-patch interaction stage, we use the Transformer model to perform intra-modal information interaction; (2) In the inter-modal pathway-patch alignment stage, we introduce a novel label-free contrastive loss to aligns semantic information between different modalities so that the features of the two modalities are mapped to the same semantic space; and (3) In the inter-modal pathway-patch fusion stage, to model the medical prior knowledge of “genotype determines phenotype”, we propose a pathway-to-patch cross fusion module to perform inter-modal information interaction under the guidance of pathway prior. In addition, the inter-modal cross fusion module of PAMT endows good interpretability, helping a pathologist to screen which pathway plays a key role, to locate where on whole slide image (WSI) are affected by the pathway, and to mine prognosis-relevant pathology image patterns. Experimental results based on three datasets of bladder urothelial carcinoma, lung squamous cell carcinoma, and lung adenocarcinoma demonstrate that the proposed framework significantly outperforms the state-of-the-art methods.
整合病理影像和基因表达的多模态数据进行癌症生存分析,比使用单一模态获得更好的结果。然而,现有的多模态学习方法忽略了两种模式之间的细粒度相互作用,特别是生物通路和病理图像斑块之间的相互作用。在本文中,我们提出了一种新的通路感知多模态变压器(PAMT)框架,用于可解释的癌症生存分析。具体而言,PAMT通过三个阶段学习细粒度的模态交互:(1)在模态内路径-路径/补丁-补丁交互阶段,我们使用Transformer模型进行模态内信息交互;(2)在模态间路径-斑块对齐阶段,我们引入了一种新的无标签对比损失来对齐不同模态之间的语义信息,使两模态的特征映射到相同的语义空间;(3)在多模态通路-贴片融合阶段,为了模拟“基因型决定表型”的医学先验知识,我们提出了通路-贴片交叉融合模块,在通路先验的指导下进行多模态信息交互。此外,PAMT的多模态交叉融合模块具有良好的可解释性,有助于病理学家筛选哪条通路起关键作用,定位整个幻灯片图像(WSI)上受该通路影响的位置,并挖掘与预后相关的病理图像模式。基于膀胱尿路上皮癌、肺鳞状细胞癌和肺腺癌三个数据集的实验结果表明,所提出的框架明显优于目前最先进的方法。最后,基于PAMT模型,我们开发了一个网站,可以在http://222.128.10.254:18822/#/上直接可视化186条路径对WSI所有地区的影响。
{"title":"Pathway-Aware Multimodal Transformer (PAMT): Integrating Pathological Image and Gene Expression for Interpretable Cancer Survival Analysis","authors":"Rui Yan;Xueyuan Zhang;Zihang Jiang;Baizhi Wang;Xiuwu Bian;Fei Ren;S. Kevin Zhou","doi":"10.1109/TPAMI.2025.3611531","DOIUrl":"10.1109/TPAMI.2025.3611531","url":null,"abstract":"Integrating multimodal data of pathological image and gene expression for cancer survival analysis can achieve better results than using a single modality. However, existing multimodal learning methods ignore fine-grained interactions between both modalities, especially the interactions between biological pathways and pathological image patches. In this article, we propose a novel Pathway-Aware Multimodal Transformer (PAMT) framework for interpretable cancer survival analysis. Specifically, the PAMT learns fine-grained modality interaction through three stages: (1) In the intra-modal <italic>pathway-pathway / patch-patch</i> interaction stage, we use the Transformer model to perform intra-modal information interaction; (2) In the inter-modal <italic>pathway-patch</i> alignment stage, we introduce a novel label-free contrastive loss to aligns semantic information between different modalities so that the features of the two modalities are mapped to the same semantic space; and (3) In the inter-modal <italic>pathway-patch</i> fusion stage, to model the medical prior knowledge of “genotype determines phenotype”, we propose a pathway-to-patch cross fusion module to perform inter-modal information interaction under the guidance of pathway prior. In addition, the inter-modal cross fusion module of PAMT endows good interpretability, helping a pathologist to <italic>screen</i> which pathway plays a key role, to <italic>locate</i> where on whole slide image (WSI) are affected by the pathway, and to <italic>mine</i> prognosis-relevant pathology image patterns. Experimental results based on three datasets of bladder urothelial carcinoma, lung squamous cell carcinoma, and lung adenocarcinoma demonstrate that the proposed framework significantly outperforms the state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"896-913"},"PeriodicalIF":18.6,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145083442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection 利用未标记数据来增强面向对象的检测。
IF 18.6 Pub Date : 2025-09-18 DOI: 10.1109/TPAMI.2025.3611519
Dingkang Liang;Wei Hua;Chunsheng Shi;Zhikang Zou;Xiaoqing Ye;Xiang Bai
Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving oriented objects common in aerial images unexplored. At the same time, the annotation cost of oriented objects is significantly higher than that of their horizontal counterparts (an approximate 36.5% increase in costs). Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images usually have arbitrary orientations, small scales, and dense distribution, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V2.0/DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.90/2.14, +2.16/2.18, and +2.66/2.32) mAP under 10%, 20%, and 30% labeled data settings, respectively, with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. Moreover, our method demonstrates stable generalization ability across different oriented detectors, even for multi-view oriented 3D object detectors.
半监督目标检测(SSOD)是利用未标记数据来增强目标检测器的一种新方法。然而,现有的SSOD方法主要关注水平目标,而没有对航空图像中常见的定向目标进行探索。同时,面向对象的注释成本明显高于水平对象的注释成本(大约增加36.5%的成本)。因此,在本文中,我们提出了一种简单而有效的半监督定向目标检测方法,称为SOOD++。具体来说,我们观察到来自航空图像的目标通常具有任意方向,小尺度和密集分布,这激发了以下核心设计:使用简单实例感知密集采样(SIDS)策略生成综合密集伪标签;几何感知自适应加权(GAW)损失利用空中物体复杂的几何信息动态调节伪标签和相应预测之间每对的重要性;我们将航空图像视为全局布局,并通过提出的噪声驱动的全局一致性(NGC)明确地在伪标签集和预测集之间建立多对多关系。在不同标记设置下的各种面向对象数据集上进行的大量实验证明了我们的方法的有效性。例如,在DOTA-V2.0/DOTA-V1.5基准测试中,该方法在单尺度训练和测试的情况下,在10%、20%和30%的标记数据设置下,分别比之前的最先进的(SOTA) mAP表现出较大的优势(+2.90/2.14、+2.16/2.18和+2.66/2.32)。更重要的是,它仍然提高了强大的监督基线70.66 mAP,使用完整的DOTA-V1.5训练集训练,+1.82 mAP,得到72.48 mAP,推动了新的最先进的技术。此外,我们的方法在不同方向的检测器上具有稳定的泛化能力,甚至对于多视图方向的三维目标检测器也是如此。代码将被提供。
{"title":"SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection","authors":"Dingkang Liang;Wei Hua;Chunsheng Shi;Zhikang Zou;Xiaoqing Ye;Xiang Bai","doi":"10.1109/TPAMI.2025.3611519","DOIUrl":"10.1109/TPAMI.2025.3611519","url":null,"abstract":"Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving oriented objects common in aerial images unexplored. At the same time, the annotation cost of oriented objects is significantly higher than that of their horizontal counterparts (an approximate 36.5% increase in costs). Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images usually have arbitrary orientations, small scales, and dense distribution, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V2.0/DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.90/2.14, +2.16/2.18, and +2.66/2.32) mAP under 10%, 20%, and 30% labeled data settings, respectively, with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. Moreover, our method demonstrates stable generalization ability across different oriented detectors, even for multi-view oriented 3D object detectors.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"840-858"},"PeriodicalIF":18.6,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145083517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1