Pub Date : 2025-11-06DOI: 10.1109/TPAMI.2025.3630178
Liping Deng;Maziar Raissi;MingQing Xiao
Sequential Model-Based Optimization (SMBO) is a highly effective strategy for hyperparameter search in machine learning. It utilizes a surrogate model that fits previous trials and approximates the hyperparameter response surface (performance). This surrogate model primarily guides the decision-making process for selecting the next set of hyperparameters. Existing classic surrogates, such as Gaussian processes and random forests, focus solely on the current task of interest and cannot incorporate trials from historical tasks. This limitation hinders their efficacy in various applications. Inspired by the state-of-the-art convolutional neural process, this paper proposes a novel meta-learning-based surrogate model for efficient and effective hyperparameter optimization. Our surrogate is trained on the meta-knowledge from a range of historical tasks, enabling it to accurately predict the hyperparameter response surface even with a limited number of trials on a new task. We tested our approach on the hyperparameter selection problem for the well-known support vector machine (SVM), residual neural network (ResNet), and vision transformer (ViT) across hundreds of real-world classification datasets. The empirical results demonstrate its superiority over existing surrogate models, highlighting the effectiveness of meta-learning in hyperparameter optimization.
{"title":"Meta-Learning-Based Surrogate Models for Efficient Hyperparameter Optimization","authors":"Liping Deng;Maziar Raissi;MingQing Xiao","doi":"10.1109/TPAMI.2025.3630178","DOIUrl":"10.1109/TPAMI.2025.3630178","url":null,"abstract":"Sequential Model-Based Optimization (SMBO) is a highly effective strategy for hyperparameter search in machine learning. It utilizes a surrogate model that fits previous trials and approximates the hyperparameter response surface (performance). This surrogate model primarily guides the decision-making process for selecting the next set of hyperparameters. Existing classic surrogates, such as Gaussian processes and random forests, focus solely on the current task of interest and cannot incorporate trials from historical tasks. This limitation hinders their efficacy in various applications. Inspired by the state-of-the-art convolutional neural process, this paper proposes a novel meta-learning-based surrogate model for efficient and effective hyperparameter optimization. Our surrogate is trained on the meta-knowledge from a range of historical tasks, enabling it to accurately predict the hyperparameter response surface even with a limited number of trials on a new task. We tested our approach on the hyperparameter selection problem for the well-known support vector machine (SVM), residual neural network (ResNet), and vision transformer (ViT) across hundreds of real-world classification datasets. The empirical results demonstrate its superiority over existing surrogate models, highlighting the effectiveness of meta-learning in hyperparameter optimization.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3931-3938"},"PeriodicalIF":18.6,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145454731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27DOI: 10.1109/TPAMI.2025.3626134
Banglei Guan;Ji Zhao
We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem—where the scale transformation of multi-camera systems is unknown—we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy.
{"title":"Affine Correspondences Between Multi-Camera Systems for Relative Pose Estimation","authors":"Banglei Guan;Ji Zhao","doi":"10.1109/TPAMI.2025.3626134","DOIUrl":"10.1109/TPAMI.2025.3626134","url":null,"abstract":"We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem—where the scale transformation of multi-camera systems is unknown—we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2012-2029"},"PeriodicalIF":18.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.
{"title":"High-Resolution Open-Vocabulary Object 6D Pose Estimation","authors":"Jaime Corsetti;Davide Boscaini;Francesco Giuliari;Changjae Oh;Andrea Cavallaro;Fabio Poiesi","doi":"10.1109/TPAMI.2025.3624589","DOIUrl":"10.1109/TPAMI.2025.3624589","url":null,"abstract":"The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on <italic>all</i> datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2066-2077"},"PeriodicalIF":18.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.
{"title":"SMC++: Masked Learning of Unsupervised Video Semantic Compression","authors":"Yuan Tian;Xiaoyue Ling;Cong Geng;Qiang Hu;Guo Lu;Guangtao Zhai","doi":"10.1109/TPAMI.2025.3625063","DOIUrl":"10.1109/TPAMI.2025.3625063","url":null,"abstract":"Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1992-2011"},"PeriodicalIF":18.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unpaired image restoration (UIR) is a significant task due to the difficulty of acquiring paired degraded/clear images with identical backgrounds. In this paper, we propose a novel UIR method based on the assumption that an image contains both degradation-related features, which affect the level of degradation, and degradation-unrelated features, such as texture and semantic information. Our method aims to ensure that the degradation-related features of the restoration result closely resemble those of the clear image, while the degradation-unrelated features align with the input degraded image. Specifically, we introduce a Feature Orthogonalization Module optimized on Stiefel manifold to decouple image features, ensuring feature uncorrelation. A task-driven Depth-wise Feature Classifier is proposed to assign weights to uncorrelated features based on their relevance to degradation prediction. To avoid the dependence of the training process on the quality of the clear image in a single pair of input data, we propose to maintain several degradation-related proxies describing the degradation level of clear images to enhance the model’s robustness. Finally, a weighted PatchNCE loss is introduced to pull degradation-related features in the output image toward those of clear images, while bringing degradation-unrelated features close to those of the degraded input.
{"title":"Orthogonal Decoupling Contrastive Regularization: Toward Uncorrelated Feature Decoupling for Unpaired Image Restoration","authors":"Zhongze Wang;Jingchao Peng;Haitao Zhao;Lujian Yao;Kaijie Zhao","doi":"10.1109/TPAMI.2025.3620803","DOIUrl":"10.1109/TPAMI.2025.3620803","url":null,"abstract":"Unpaired image restoration (UIR) is a significant task due to the difficulty of acquiring paired degraded/clear images with identical backgrounds. In this paper, we propose a novel UIR method based on the assumption that an image contains both degradation-related features, which affect the level of degradation, and degradation-unrelated features, such as texture and semantic information. Our method aims to ensure that the degradation-related features of the restoration result closely resemble those of the clear image, while the degradation-unrelated features align with the input degraded image. Specifically, we introduce a Feature Orthogonalization Module optimized on Stiefel manifold to decouple image features, ensuring feature uncorrelation. A task-driven Depth-wise Feature Classifier is proposed to assign weights to uncorrelated features based on their relevance to degradation prediction. To avoid the dependence of the training process on the quality of the clear image in a single pair of input data, we propose to maintain several degradation-related proxies describing the degradation level of clear images to enhance the model’s robustness. Finally, a weighted PatchNCE loss is introduced to pull degradation-related features in the output image toward those of clear images, while bringing degradation-unrelated features close to those of the degraded input.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1842-1859"},"PeriodicalIF":18.6,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145331646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text-to-image customization aims to generate images that align with both the given text and the subject in the given image. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word inherently conflicts and entangles with other real words, resulting in a dual-optimum paradox between the subject similarity and text controllability. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to generate a coherent guidance image and corresponding subject mask, there by disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real text words; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask, and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. Extensive experiments validate RealCustom++s superior performance, which improves controllability by 7.48%, similarity by 3.04% and quality by 76.43% simultaneously. Moreover, RealCustom++ further improves controllability by 4.6% and multi-subject similarity by 6.34% for multisubject customization
{"title":"RealCustom++: Representing Images as Real Textual Word for Real-Time Customization","authors":"Zhendong Mao;Mengqi Huang;Fei Ding;Mingcong Liu;Qian He;Yongdong Zhang","doi":"10.1109/TPAMI.2025.3623025","DOIUrl":"10.1109/TPAMI.2025.3623025","url":null,"abstract":"Text-to-image customization aims to generate images that align with both the given text and the subject in the given image. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word inherently conflicts and entangles with other real words, resulting in a dual-optimum paradox between the subject similarity and text controllability. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to generate a coherent guidance image and corresponding subject mask, there by disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real text words; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask, and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. Extensive experiments validate RealCustom++s superior performance, which improves controllability by 7.48%, similarity by 3.04% and quality by 76.43% simultaneously. Moreover, RealCustom++ further improves controllability by 4.6% and multi-subject similarity by 6.34% for multisubject customization","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2078-2095"},"PeriodicalIF":18.6,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145310795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1109/TPAMI.2025.3621631
Jing Wang;Yongchao Xu;Jing Tang;Zeyu Gong;Bo Tao;Clarence W. de Silva;Xiang Bai
A central challenge in source-free domain adaptation (SFDA) is the lack of a theoretical framework for explicitly analyzing domain shifts, as the absence of source data prevents direct domain comparisons. In this paper, we introduce the Vicinal Gaussian Transform (VGT), an analytical operator that models source-informed latent vicinities as Gaussians and shows that vicinal prediction divergence is bounded by their covariance. By this formulation, SFDA can be reframed as shrinking covariance to reinforce label consistency. To operationalize this idea, we introduce the Energy-based VGT (EBVGT), a novel SDE that realizes the Gaussian transform by contracting covariance through a denoising mechanism. A recovery-likelihood with a Schrödinger-Bridge smoothness penalty denoises perturbed states, while a BYOL-derived energy function, directly obtained from model predictions, provides the score to guide label-consistent trajectories within the vicinity. This design not only yields noise-suppressed vicinal features for adaptation without source data, but also eliminates the need for additional learnable parameters for score estimation, in contrast to conventional deep SDEs. Our EBVGT is model- and modality-agnostic, efficient for classification, and improves state-of-the-art SFDA methods by 1.3–3.0% (2.0% on average) across both 2D image and 3D point cloud benchmarks.
{"title":"Vicinal Gaussian Transform: Rethinking Source-Free Domain Adaptation Through Source-Informed Label Consistency","authors":"Jing Wang;Yongchao Xu;Jing Tang;Zeyu Gong;Bo Tao;Clarence W. de Silva;Xiang Bai","doi":"10.1109/TPAMI.2025.3621631","DOIUrl":"10.1109/TPAMI.2025.3621631","url":null,"abstract":"A central challenge in source-free domain adaptation (SFDA) is the lack of a theoretical framework for explicitly analyzing domain shifts, as the absence of source data prevents direct domain comparisons. In this paper, we introduce the Vicinal Gaussian Transform (VGT), an analytical operator that models source-informed latent vicinities as Gaussians and shows that vicinal prediction divergence is bounded by their covariance. By this formulation, SFDA can be reframed as shrinking covariance to reinforce label consistency. To operationalize this idea, we introduce the Energy-based VGT (EBVGT), a novel SDE that realizes the Gaussian transform by contracting covariance through a denoising mechanism. A recovery-likelihood with a Schrödinger-Bridge smoothness penalty denoises perturbed states, while a BYOL-derived energy function, directly obtained from model predictions, provides the score to guide label-consistent trajectories within the vicinity. This design not only yields noise-suppressed vicinal features for adaptation without source data, but also eliminates the need for additional learnable parameters for score estimation, in contrast to conventional deep SDEs. Our EBVGT is model- and modality-agnostic, efficient for classification, and improves state-of-the-art SFDA methods by 1.3–3.0% (2.0% on average) across both 2D image and 3D point cloud benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2030-2047"},"PeriodicalIF":18.6,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145295630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation (NCE) has been proposed by formulating the objective as the logistic loss between the real data and the artificial noise. However, previous research indicates that NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study a direct approach for optimizing the negative log-likelihood of unnormalized models through the lens of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be expressed as a compositional function whose inner function can be estimated using stochastic samples. Consequently, the objective can be optimized via stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results in Gaussian mean estimation by showing our method has a much favorable loss landscape and enjoys faster convergence; (3) demonstrating better performance on various applications, including density estimation, out-of-distribution detection, and real image generation.
{"title":"Optimizing Unnormalized Statistical Models Through Compositional Optimization","authors":"Wei Jiang;Jiayu Qin;Lingyu Wu;Changyou Chen;Tianbao Yang;Lijun Zhang","doi":"10.1109/TPAMI.2025.3621320","DOIUrl":"10.1109/TPAMI.2025.3621320","url":null,"abstract":"Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation (NCE) has been proposed by formulating the objective as the logistic loss between the real data and the artificial noise. However, previous research indicates that NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study a direct approach for optimizing the negative log-likelihood of unnormalized models through the lens of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be expressed as a compositional function whose inner function can be estimated using stochastic samples. Consequently, the objective can be optimized via stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results in Gaussian mean estimation by showing our method has a much favorable loss landscape and enjoys faster convergence; (3) demonstrating better performance on various applications, including density estimation, out-of-distribution detection, and real image generation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1949-1960"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1109/TPAMI.2025.3621250
Jie Wen;Yicheng Liu;Chao Huang;Chengliang Liu;Yong Xu;Xiaochun Cao
Fine-tuning pre-trained vision-language models (VLMs) has shown substantial benefits in a wide range of downstream tasks, often achieving impressive performance with minimal labeled data. Parameter-efficient fine-tuning techniques, in particular, have demonstrated their effectiveness in enhancing downstream task performance. However, these methods frequently struggle to generalize to out-of-distribution (OOD) data due to their reliance on non-causal representations, which can introduce biases and spurious correlations that negatively impact decision-making. Such spurious factors hinder the model’s generalization ability beyond the training distribution. To address these challenges, in this paper, we propose a novel causal intervention-based prompt tuning method to adapt VLMs to few-shot OOD generalization. Specifically, we leverage the front-door adjustment technique from causal inference to mitigate the effects of spurious correlations and enhance the model’s focus on causal relationships. Built upon VLMs, our approach begins by decoupling causal and non-causal representations in the vision-language alignment process. The causal representation that captures only essential semantically relevant information can serve as a mediator variable between the input image and output label, mitigating the biases from the latent confounder. To further enrich this causal representation, we propose a novel text-based diversity augmentation technique that uses textual features to provide additional semantic context. This augmentation technique can enhance the diversity of the causal representation, making it more robust and generalizable to various OOD scenarios. Experimental results across multiple OOD datasets demonstrate that our method significantly outperforms existing approaches, achieving state-of-the-art generalization performance.
{"title":"Causal Interventional Prompt Tuning for Few-Shot Out-of-Distribution Generalization","authors":"Jie Wen;Yicheng Liu;Chao Huang;Chengliang Liu;Yong Xu;Xiaochun Cao","doi":"10.1109/TPAMI.2025.3621250","DOIUrl":"10.1109/TPAMI.2025.3621250","url":null,"abstract":"Fine-tuning pre-trained vision-language models (VLMs) has shown substantial benefits in a wide range of downstream tasks, often achieving impressive performance with minimal labeled data. Parameter-efficient fine-tuning techniques, in particular, have demonstrated their effectiveness in enhancing downstream task performance. However, these methods frequently struggle to generalize to out-of-distribution (OOD) data due to their reliance on non-causal representations, which can introduce biases and spurious correlations that negatively impact decision-making. Such spurious factors hinder the model’s generalization ability beyond the training distribution. To address these challenges, in this paper, we propose a novel causal intervention-based prompt tuning method to adapt VLMs to few-shot OOD generalization. Specifically, we leverage the front-door adjustment technique from causal inference to mitigate the effects of spurious correlations and enhance the model’s focus on causal relationships. Built upon VLMs, our approach begins by decoupling causal and non-causal representations in the vision-language alignment process. The causal representation that captures only essential semantically relevant information can serve as a mediator variable between the input image and output label, mitigating the biases from the latent confounder. To further enrich this causal representation, we propose a novel text-based diversity augmentation technique that uses textual features to provide additional semantic context. This augmentation technique can enhance the diversity of the causal representation, making it more robust and generalizable to various OOD scenarios. Experimental results across multiple OOD datasets demonstrate that our method significantly outperforms existing approaches, achieving state-of-the-art generalization performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1978-1991"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145288451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1109/TPAMI.2025.3621650
Ngoc-Quan Ha-Phan;Myungsik Yoo
LiDAR perception for autonomous driving applications offers highly accurate scene depiction in three-dimensional (3D) spaces, whose most representative task is LiDAR panoptic segmentation (LPS), as it offers exhibition of both instance- and semantic-level segmentation in a holistic manner. Although previous approaches have achieved mature performance, no research has explored temporal information for enhancing LPS performance. As multi-frame processing can assist in better predictions in terms of feature representation and recursive forecasting, which has been proven in other LiDAR perception challenges, this study proposes an effective and temporal-aware panoptic segmentation method for LiDAR point clouds. Specifically, we introduce two modules: convolution-based cross-frame fusion attention (CFFA) and adjacent shifted feature encoder (ASFE) modules. The CFFA module can fuse multi-frame features on the basis of the idea of convolution-based attention, whereas the ASFE module leverages adjacent model outputs and serves as an intermediate guide for final segmentation predictions. Consequent to our extensive experiments, the two modules have been reaffirmed in terms of their productivity in the realm of the LPS. The proposed LPS model achieves impressive panoptic-quality metric scores that are evaluated on different popular benchmarks (63.36% under SemanticKITTI and 78.54% under Panoptic nuScenes), outperforming previous state-of-the-art methods by a significant margin. Further quantitative and qualitative analyses provide evidence of the advantages of multi-frame processing for the LPS together with demonstrations of its particular behavior under different settings.
{"title":"Exploiting the Benefits of Temporal Information in the Realm of LiDAR Panoptic Segmentation","authors":"Ngoc-Quan Ha-Phan;Myungsik Yoo","doi":"10.1109/TPAMI.2025.3621650","DOIUrl":"10.1109/TPAMI.2025.3621650","url":null,"abstract":"LiDAR perception for autonomous driving applications offers highly accurate scene depiction in three-dimensional (3D) spaces, whose most representative task is LiDAR panoptic segmentation (LPS), as it offers exhibition of both instance- and semantic-level segmentation in a holistic manner. Although previous approaches have achieved mature performance, no research has explored temporal information for enhancing LPS performance. As multi-frame processing can assist in better predictions in terms of feature representation and recursive forecasting, which has been proven in other LiDAR perception challenges, this study proposes an effective and temporal-aware panoptic segmentation method for LiDAR point clouds. Specifically, we introduce two modules: convolution-based cross-frame fusion attention (CFFA) and adjacent shifted feature encoder (ASFE) modules. The CFFA module can fuse multi-frame features on the basis of the idea of convolution-based attention, whereas the ASFE module leverages adjacent model outputs and serves as an intermediate guide for final segmentation predictions. Consequent to our extensive experiments, the two modules have been reaffirmed in terms of their productivity in the realm of the LPS. The proposed LPS model achieves impressive panoptic-quality metric scores that are evaluated on different popular benchmarks (63.36% under SemanticKITTI and 78.54% under Panoptic nuScenes), outperforming previous state-of-the-art methods by a significant margin. Further quantitative and qualitative analyses provide evidence of the advantages of multi-frame processing for the LPS together with demonstrations of its particular behavior under different settings.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2048-2065"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145288448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}