This paper revisits the canonical concept of learning structured representations without label supervision by eigendecomposition. Yet, unlike prior spectral methods such as Laplacian Eigenmap which operate in a nonparametric manner, we aim to parametrically model the principal eigenfunctions of an integral operator defined by a kernel and a data distribution using a neural network for enhanced scalability and reasonable out-of-sample generalization. To achieve this goal, we first present a new series of objective functions that generalize the EigenGame [1] to function space for learning neural eigenfunctions. We then show that, when the similarity metric is derived from positive relations in a data augmentation setup, a representation learning objective function that resembles those of popular self-supervised learning methods emerges, with an additional symmetry-breaking property for producing structured representations where features are ordered by importance. We call such a structured, adaptive-length deep representation Neural Eigenmap. We demonstrate using Neural Eigenmap as adaptive-length codes in image retrieval systems. By truncation according to feature importance, our method requires up to $16times$ shorter representation length than leading self-supervised learning ones to achieve similar retrieval performance. We further apply our method to graph data and report strong results on a node representation learning benchmark with more than one million nodes.
{"title":"Neural Eigenfunctions Are Structured Representation Learners.","authors":"Zhijie Deng, Jiaxin Shi, Hao Zhang, Peng Cui, Cewu Lu, Jun Zhu","doi":"10.1109/TPAMI.2025.3625728","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3625728","url":null,"abstract":"<p><p>This paper revisits the canonical concept of learning structured representations without label supervision by eigendecomposition. Yet, unlike prior spectral methods such as Laplacian Eigenmap which operate in a nonparametric manner, we aim to parametrically model the principal eigenfunctions of an integral operator defined by a kernel and a data distribution using a neural network for enhanced scalability and reasonable out-of-sample generalization. To achieve this goal, we first present a new series of objective functions that generalize the EigenGame [1] to function space for learning neural eigenfunctions. We then show that, when the similarity metric is derived from positive relations in a data augmentation setup, a representation learning objective function that resembles those of popular self-supervised learning methods emerges, with an additional symmetry-breaking property for producing structured representations where features are ordered by importance. We call such a structured, adaptive-length deep representation Neural Eigenmap. We demonstrate using Neural Eigenmap as adaptive-length codes in image retrieval systems. By truncation according to feature importance, our method requires up to $16times$ shorter representation length than leading self-supervised learning ones to achieve similar retrieval performance. We further apply our method to graph data and report strong results on a node representation learning benchmark with more than one million nodes.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27DOI: 10.1109/TPAMI.2025.3626134
Banglei Guan, Ji Zhao
We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem-where the scale transformation of multi-camera systems is unknown-we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy. Source code is available at https://github.com/jizhaox/relpose-mcs-depth.
{"title":"Affine Correspondences between Multi-Camera Systems for Relative Pose Estimation.","authors":"Banglei Guan, Ji Zhao","doi":"10.1109/TPAMI.2025.3626134","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3626134","url":null,"abstract":"<p><p>We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem-where the scale transformation of multi-camera systems is unknown-we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy. Source code is available at https://github.com/jizhaox/relpose-mcs-depth.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1109/TPAMI.2025.3624589
Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi
The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota Light, Linemod, and YCB-Video. Our method achieves state-of the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.
{"title":"High-resolution open-vocabulary object 6D pose estimation.","authors":"Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi","doi":"10.1109/TPAMI.2025.3624589","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3624589","url":null,"abstract":"<p><p>The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota Light, Linemod, and YCB-Video. Our method achieves state-of the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.
{"title":"SMC++: Masked Learning of Unsupervised Video Semantic Compression.","authors":"Yuan Tian, Xiaoyue Ling, Cong Geng, Qiang Hu, Guo Lu, Guangtao Zha","doi":"10.1109/TPAMI.2025.3625063","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3625063","url":null,"abstract":"<p><p>Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/TPAMI.2025.3622234
Patrick Muller, Alexander Braun, Margret Keuper
<p><p>Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur. Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that, in many cases, models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse common corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses with diverse aberrations, qualities, and types. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur. In addition, we show on ImageNet-100 with our OpticsAugment framework that robustness can be increased by using optical kernels as data augmentation. Compared to a conventionally trained ResNeXt50, training wit
{"title":"Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models.","authors":"Patrick Muller, Alexander Braun, Margret Keuper","doi":"10.1109/TPAMI.2025.3622234","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3622234","url":null,"abstract":"<p><p>Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur. Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that, in many cases, models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse common corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses with diverse aberrations, qualities, and types. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur. In addition, we show on ImageNet-100 with our OpticsAugment framework that robustness can be increased by using optical kernels as data augmentation. Compared to a conventionally trained ResNeXt50, training wit","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145310346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-23DOI: 10.1109/TPAMI.2025.3613712
Chen Liang;Wenguan Wang;Ling Chen;Yi Yang
Abductive reasoning seeks the likeliest possible explanation for partial observations. Although being frequently employed in human daily reasoning, abduction is rarely explored in computer vision literature. In this article, we propose a new task, Visual Abductive Reasoning (VAR), that underpins the machine intelligence study of abductive reasoning in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the observed premise. We create the first large-scale VAR dataset, which contains a total of 9K examples. We further devise a transformer-based VAR model – Reasonerv2 – for knowledge-driven, causal-and-cascaded reasoning. Reasonerv2 first adopts a contextualized directional position embedding strategy in the encoder, to capture the causal-related temporal structure of the observations, and yield discriminative representations for the premises and hypotheses. Then, Reasonerv2 extracts condensed causal knowledge from external knowledge bases, for reasoning beyond observation. Finally, Reasonerv2 cascades multiple decoders so as to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that Reasonerv2 surpasses many famous video-language models, while still being far behind human performance.
{"title":"Data-And Knowledge-Driven Visual Abductive Reasoning","authors":"Chen Liang;Wenguan Wang;Ling Chen;Yi Yang","doi":"10.1109/TPAMI.2025.3613712","DOIUrl":"10.1109/TPAMI.2025.3613712","url":null,"abstract":"Abductive reasoning seeks the likeliest possible explanation for partial observations. Although being frequently employed in human daily reasoning, abduction is rarely explored in computer vision literature. In this article, we propose a new task, Visual Abductive Reasoning (VAR), that underpins the machine intelligence study of abductive reasoning in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the observed premise. We create the first large-scale VAR dataset, which contains a total of 9K examples. We further devise a transformer-based VAR model – <sc>Reasoner</small>v2 – for knowledge-driven, causal-and-cascaded reasoning. <sc>Reasoner</small>v2 first adopts a contextualized directional position embedding strategy in the encoder, to capture the causal-related temporal structure of the observations, and yield discriminative representations for the premises and hypotheses. Then, <sc>Reasoner</small>v2 extracts condensed causal knowledge from external knowledge bases, for reasoning beyond observation. Finally, <sc>Reasoner</small>v2 cascades multiple decoders so as to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that <sc>Reasoner</small>v2 surpasses many famous video-language models, while still being far behind human performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"792-806"},"PeriodicalIF":18.6,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145127459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird’s-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, with multi-guided global interaction and LiDAR-guided adaptive fusion, named MGAF. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth and spatial information. The designed semantic segmentation network captures category and orientation prior information for raw point clouds. In the following, an Adaptive Fusion Dual Transformer (AFDT) is developed to adaptively enhance the interaction of different modal BEV features from both global and bidirectional perspectives. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed in order to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. Notably, the proposed AFDT is general, which also shows superior performance on other models. Our framework has undergone extensive experimentation on the large-scale nuScenes dataset, Waymo Open Dataset, and long-range Argoverse2 dataset, consistently demonstrating state-of-the-art performance.
{"title":"MGAF: LiDAR-Camera 3D Object Detection With Multiple Guidance and Adaptive Fusion","authors":"Baojie Fan;Xiaotian Li;Yuhan Zhou;Caixia Xia;Huijie Fan;Fengyu Xu;Jiandong Tian","doi":"10.1109/TPAMI.2025.3612958","DOIUrl":"10.1109/TPAMI.2025.3612958","url":null,"abstract":"Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird’s-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, with multi-guided global interaction and LiDAR-guided adaptive fusion, named MGAF. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth and spatial information. The designed semantic segmentation network captures category and orientation prior information for raw point clouds. In the following, an Adaptive Fusion Dual Transformer (AFDT) is developed to adaptively enhance the interaction of different modal BEV features from both global and bidirectional perspectives. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed in order to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. Notably, the proposed AFDT is general, which also shows superior performance on other models. Our framework has undergone extensive experimentation on the large-scale nuScenes dataset, Waymo Open Dataset, and long-range Argoverse2 dataset, consistently demonstrating state-of-the-art performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"824-839"},"PeriodicalIF":18.6,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145117074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1109/TPAMI.2025.3611795
Linshen Liu;Pu Wang;Guanlin Wu;Junyue Jiang;Hao Frank Yang
Autonomous vehicles, open-world robots, and other automated systems rely on accurate, efficient perception modules for real-time object detection. Although high-precision models improve reliability, their processing time and computational overhead can hinder real-time performance and raise safety concerns. This paper introduces an Edge-based Mixture-of-Experts Optimal Sensing (EMOS) System that addresses the challenge of co-achieving accuracy, latency and scene adaptivity, further demonstrated in the open-world autonomous driving scenarios. Algorithmically, EMOS fuses multimodal sensor streams via an Adaptive Multimodal Data Bridge and uses a scenario-aware MoE switch to activate only a complementary set of specialized experts as needed. The proposed hierarchical backpropagation and a multiscale pooling layer let model capacity scale with real-world demand complexity. System-wise, an edge-optimized runtime with accelerator-aware scheduling (e.g., ONNX/TensorRT), zero-copy buffering, and overlapped I/O–compute enforces explicit latency/accuracy budgets across diverse driving conditions. Experimental results establish EMOS as the new state of the art: on KITTI, it increases average AP by 3.17% while running $2.6times$ faster on Nvidia Jetson. On nuScenes, it improves accuracy by 0.2% mAP and 0.5% NDS, with 34% fewer parameters and a $15.35times$ Nvidia Jetson speedup. Leveraging multimodal data and intelligent experts cooperation, EMOS delivers accurate, efficient and edge-adaptive perception system for autonomous vehicles, thereby ensuring robust, timely responses in real-world scenarios.
{"title":"Toward Optimal Mixture of Experts System for 3D Object Detection: A Game of Accuracy, Efficiency and Adaptivity","authors":"Linshen Liu;Pu Wang;Guanlin Wu;Junyue Jiang;Hao Frank Yang","doi":"10.1109/TPAMI.2025.3611795","DOIUrl":"10.1109/TPAMI.2025.3611795","url":null,"abstract":"Autonomous vehicles, open-world robots, and other automated systems rely on accurate, efficient perception modules for real-time object detection. Although high-precision models improve reliability, their processing time and computational overhead can hinder real-time performance and raise safety concerns. This paper introduces an <i>Edge-based Mixture-of-Experts Optimal Sensing</i> (<i>EMOS</i>) System that addresses the challenge of co-achieving accuracy, latency and scene adaptivity, further demonstrated in the open-world autonomous driving scenarios. Algorithmically, <i>EMOS</i> fuses multimodal sensor streams via an Adaptive Multimodal Data Bridge and uses a scenario-aware MoE switch to activate only a complementary set of specialized experts as needed. The proposed hierarchical backpropagation and a multiscale pooling layer let model capacity scale with real-world demand complexity. System-wise, an edge-optimized runtime with accelerator-aware scheduling (e.g., ONNX/TensorRT), zero-copy buffering, and overlapped I/O–compute enforces explicit latency/accuracy budgets across diverse driving conditions. Experimental results establish <i>EMOS</i> as the new state of the art: on KITTI, it increases average AP by 3.17% while running <inline-formula><tex-math>$2.6times$</tex-math></inline-formula> faster on Nvidia Jetson. On nuScenes, it improves accuracy by 0.2% mAP and 0.5% NDS, with 34% fewer parameters and a <inline-formula><tex-math>$15.35times$</tex-math></inline-formula> Nvidia Jetson speedup. Leveraging multimodal data and intelligent experts cooperation, <i>EMOS</i> delivers accurate, efficient and edge-adaptive perception system for autonomous vehicles, thereby ensuring robust, timely responses in real-world scenarios.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"914-931"},"PeriodicalIF":18.6,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145117071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-18DOI: 10.1109/TPAMI.2025.3611531
Rui Yan;Xueyuan Zhang;Zihang Jiang;Baizhi Wang;Xiuwu Bian;Fei Ren;S. Kevin Zhou
Integrating multimodal data of pathological image and gene expression for cancer survival analysis can achieve better results than using a single modality. However, existing multimodal learning methods ignore fine-grained interactions between both modalities, especially the interactions between biological pathways and pathological image patches. In this article, we propose a novel Pathway-Aware Multimodal Transformer (PAMT) framework for interpretable cancer survival analysis. Specifically, the PAMT learns fine-grained modality interaction through three stages: (1) In the intra-modal pathway-pathway / patch-patch interaction stage, we use the Transformer model to perform intra-modal information interaction; (2) In the inter-modal pathway-patch alignment stage, we introduce a novel label-free contrastive loss to aligns semantic information between different modalities so that the features of the two modalities are mapped to the same semantic space; and (3) In the inter-modal pathway-patch fusion stage, to model the medical prior knowledge of “genotype determines phenotype”, we propose a pathway-to-patch cross fusion module to perform inter-modal information interaction under the guidance of pathway prior. In addition, the inter-modal cross fusion module of PAMT endows good interpretability, helping a pathologist to screen which pathway plays a key role, to locate where on whole slide image (WSI) are affected by the pathway, and to mine prognosis-relevant pathology image patterns. Experimental results based on three datasets of bladder urothelial carcinoma, lung squamous cell carcinoma, and lung adenocarcinoma demonstrate that the proposed framework significantly outperforms the state-of-the-art methods.
{"title":"Pathway-Aware Multimodal Transformer (PAMT): Integrating Pathological Image and Gene Expression for Interpretable Cancer Survival Analysis","authors":"Rui Yan;Xueyuan Zhang;Zihang Jiang;Baizhi Wang;Xiuwu Bian;Fei Ren;S. Kevin Zhou","doi":"10.1109/TPAMI.2025.3611531","DOIUrl":"10.1109/TPAMI.2025.3611531","url":null,"abstract":"Integrating multimodal data of pathological image and gene expression for cancer survival analysis can achieve better results than using a single modality. However, existing multimodal learning methods ignore fine-grained interactions between both modalities, especially the interactions between biological pathways and pathological image patches. In this article, we propose a novel Pathway-Aware Multimodal Transformer (PAMT) framework for interpretable cancer survival analysis. Specifically, the PAMT learns fine-grained modality interaction through three stages: (1) In the intra-modal <italic>pathway-pathway / patch-patch</i> interaction stage, we use the Transformer model to perform intra-modal information interaction; (2) In the inter-modal <italic>pathway-patch</i> alignment stage, we introduce a novel label-free contrastive loss to aligns semantic information between different modalities so that the features of the two modalities are mapped to the same semantic space; and (3) In the inter-modal <italic>pathway-patch</i> fusion stage, to model the medical prior knowledge of “genotype determines phenotype”, we propose a pathway-to-patch cross fusion module to perform inter-modal information interaction under the guidance of pathway prior. In addition, the inter-modal cross fusion module of PAMT endows good interpretability, helping a pathologist to <italic>screen</i> which pathway plays a key role, to <italic>locate</i> where on whole slide image (WSI) are affected by the pathway, and to <italic>mine</i> prognosis-relevant pathology image patterns. Experimental results based on three datasets of bladder urothelial carcinoma, lung squamous cell carcinoma, and lung adenocarcinoma demonstrate that the proposed framework significantly outperforms the state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"896-913"},"PeriodicalIF":18.6,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145083442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-18DOI: 10.1109/TPAMI.2025.3611519
Dingkang Liang;Wei Hua;Chunsheng Shi;Zhikang Zou;Xiaoqing Ye;Xiang Bai
Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving oriented objects common in aerial images unexplored. At the same time, the annotation cost of oriented objects is significantly higher than that of their horizontal counterparts (an approximate 36.5% increase in costs). Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images usually have arbitrary orientations, small scales, and dense distribution, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V2.0/DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.90/2.14, +2.16/2.18, and +2.66/2.32) mAP under 10%, 20%, and 30% labeled data settings, respectively, with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. Moreover, our method demonstrates stable generalization ability across different oriented detectors, even for multi-view oriented 3D object detectors.
{"title":"SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection","authors":"Dingkang Liang;Wei Hua;Chunsheng Shi;Zhikang Zou;Xiaoqing Ye;Xiang Bai","doi":"10.1109/TPAMI.2025.3611519","DOIUrl":"10.1109/TPAMI.2025.3611519","url":null,"abstract":"Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving oriented objects common in aerial images unexplored. At the same time, the annotation cost of oriented objects is significantly higher than that of their horizontal counterparts (an approximate 36.5% increase in costs). Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images usually have arbitrary orientations, small scales, and dense distribution, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V2.0/DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.90/2.14, +2.16/2.18, and +2.66/2.32) mAP under 10%, 20%, and 30% labeled data settings, respectively, with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. Moreover, our method demonstrates stable generalization ability across different oriented detectors, even for multi-view oriented 3D object detectors.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"840-858"},"PeriodicalIF":18.6,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145083517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}