Pub Date : 2025-12-03DOI: 10.1109/TPAMI.2025.3639522
Mark Lindsey;Francis Kubala;Richard M. Stern
Online Active Learning (OAL) is a powerful tool for classifying evolving data streams using limited annotations from a human operator who is a domain expert. The objective of the OAL learning paradigm is to minimize jointly the classification error rate and the annotation cost across the data stream by posing periodic Active Learning (AL) queries. In this paper, this objective is extended to include identification of classifier errors by the expert during the typical workflow. To this end, Corrective Feedback (CF) is introduced as a second channel of interaction between the expert and the learning algorithm, complementary to the AL channel, that allows the algorithm to obtain additional training labels without disrupting the expert’s workflow. Online Active Learning with Corrective Feedback (OAL-CF) is formally defined as a paradigm, and its efficacy is proven through experimental application to two binary classification tasks, Spoken Language Verification and Voice-Type Discrimination. Finally, the effects of adding CF to the OAL paradigm are analyzed in terms of classification performance, annotation cost, trends over time, and class balance of the collected training data. Overall, the addition of CF results in a 53% relative reduction in cost compared to OAL without CF.
在线主动学习(Online Active Learning, OAL)是一种强大的工具,它利用领域专家操作员提供的有限注释对不断变化的数据流进行分类。主动学习(AL)学习范式的目标是通过提出周期性的主动学习(AL)查询,使数据流中的分类错误率和标注成本共同最小化。在本文中,这一目标被扩展到包括专家在典型工作流程中识别分类器错误。为此,引入纠正反馈(CF)作为专家和学习算法之间的第二个交互通道,补充人工智能通道,使算法能够在不中断专家工作流程的情况下获得额外的训练标签。在线主动学习与纠正反馈(Online Active Learning with Corrective Feedback,简称al - cf)被正式定义为一种范式,并通过实验应用于口语验证和语音类型识别两个二元分类任务,证明了其有效性。最后,从分类性能、注释成本、随时间变化的趋势和收集的训练数据的类平衡等方面分析了将CF添加到OAL范式的影响。总的来说,与不添加CF的OAL相比,添加CF的成本相对降低了53%。
{"title":"The Value of Corrective Feedback in the Online Active Learning Paradigm","authors":"Mark Lindsey;Francis Kubala;Richard M. Stern","doi":"10.1109/TPAMI.2025.3639522","DOIUrl":"10.1109/TPAMI.2025.3639522","url":null,"abstract":"Online Active Learning (OAL) is a powerful tool for classifying evolving data streams using limited annotations from a human operator who is a domain expert. The objective of the OAL learning paradigm is to minimize jointly the classification error rate and the annotation cost across the data stream by posing periodic Active Learning (AL) queries. In this paper, this objective is extended to include identification of classifier errors by the expert during the typical workflow. To this end, Corrective Feedback (CF) is introduced as a second channel of interaction between the expert and the learning algorithm, complementary to the AL channel, that allows the algorithm to obtain additional training labels without disrupting the expert’s workflow. Online Active Learning with Corrective Feedback (OAL-CF) is formally defined as a paradigm, and its efficacy is proven through experimental application to two binary classification tasks, Spoken Language Verification and Voice-Type Discrimination. Finally, the effects of adding CF to the OAL paradigm are analyzed in terms of classification performance, annotation cost, trends over time, and class balance of the collected training data. Overall, the addition of CF results in a 53% relative reduction in cost compared to OAL without CF.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3885-3898"},"PeriodicalIF":18.6,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11274545","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145663912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TPAMI.2025.3639635
Xingcai Zhou;Guang Yang;Haotian Zheng;Linglong Kong;Jinde Cao
We study distributed principal component analysis (PCA) for large-scale federated data when the sample size $n$ and dimension $d$ are both ultra-large. This type of data is currently very common, but faces numerous challenges in PCA learning, such as communication overhead and computational complexity. We develop a new algorithm ${mathsf {FedFask}}$ (Fast Sketching for Federated learning) with lower communication cost $O(dr)$ and lower computational complexity $O(d(np/m+p^{2}+r^{2}))$, where $m$ is the number of workers, $r$ is the rank of matrix, $p$ is the dimension of sketched column space, and $rleq pll d$. In ${mathsf {FedFask}}$, we adopt and develop technologies such as fast sketching, alignments with orthogonal Procrustes Fixing, and matrix Stiefel manifold via Kolmogorov-Nagumo-type average. Thus, ${mathsf {FedFask}}$ has a higher accuracy, lower stochastic variation, and best representation of multiple randomly projected eigenspaces, and avoids the orthogonal ambiguity of eigenspaces. We show that ${mathsf {FedFask}}$ achieves the same rate of learning $Oleft(frac{kappa _{r}r}{lambda _{r}}sqrt{frac{r^{*}}{n}}right)$ as the centralized PCA uses all data, and tolerates more workers to parallel acceleration computation. We conduct extensive experiments to demonstrate the effectiveness of ${mathsf {FedFask}}$.
{"title":"FedFask: Fast Sketching Distributed PCA for Large-Scale Federated Data","authors":"Xingcai Zhou;Guang Yang;Haotian Zheng;Linglong Kong;Jinde Cao","doi":"10.1109/TPAMI.2025.3639635","DOIUrl":"10.1109/TPAMI.2025.3639635","url":null,"abstract":"We study distributed principal component analysis (PCA) for large-scale federated data when the sample size <inline-formula><tex-math>$n$</tex-math></inline-formula> and dimension <inline-formula><tex-math>$d$</tex-math></inline-formula> are both ultra-large. This type of data is currently very common, but faces numerous challenges in PCA learning, such as communication overhead and computational complexity. We develop a new algorithm <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula> (<b>Fa</b>st <b>Sk</b>etching for <b>Fed</b>erated learning) with lower communication cost <inline-formula><tex-math>$O(dr)$</tex-math></inline-formula> and lower computational complexity <inline-formula><tex-math>$O(d(np/m+p^{2}+r^{2}))$</tex-math></inline-formula>, where <inline-formula><tex-math>$m$</tex-math></inline-formula> is the number of workers, <inline-formula><tex-math>$r$</tex-math></inline-formula> is the rank of matrix, <inline-formula><tex-math>$p$</tex-math></inline-formula> is the dimension of sketched column space, and <inline-formula><tex-math>$rleq pll d$</tex-math></inline-formula>. In <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula>, we adopt and develop technologies such as fast sketching, alignments with orthogonal Procrustes Fixing, and matrix Stiefel manifold via Kolmogorov-Nagumo-type average. Thus, <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula> has a higher accuracy, lower stochastic variation, and best representation of multiple randomly projected eigenspaces, and avoids the orthogonal ambiguity of eigenspaces. We show that <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula> achieves the same rate of learning <inline-formula><tex-math>$Oleft(frac{kappa _{r}r}{lambda _{r}}sqrt{frac{r^{*}}{n}}right)$</tex-math></inline-formula> as the centralized PCA uses all data, and tolerates more workers to parallel acceleration computation. We conduct extensive experiments to demonstrate the effectiveness of <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula>.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3714-3725"},"PeriodicalIF":18.6,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145664268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1109/TPAMI.2025.3630178
Liping Deng;Maziar Raissi;MingQing Xiao
Sequential Model-Based Optimization (SMBO) is a highly effective strategy for hyperparameter search in machine learning. It utilizes a surrogate model that fits previous trials and approximates the hyperparameter response surface (performance). This surrogate model primarily guides the decision-making process for selecting the next set of hyperparameters. Existing classic surrogates, such as Gaussian processes and random forests, focus solely on the current task of interest and cannot incorporate trials from historical tasks. This limitation hinders their efficacy in various applications. Inspired by the state-of-the-art convolutional neural process, this paper proposes a novel meta-learning-based surrogate model for efficient and effective hyperparameter optimization. Our surrogate is trained on the meta-knowledge from a range of historical tasks, enabling it to accurately predict the hyperparameter response surface even with a limited number of trials on a new task. We tested our approach on the hyperparameter selection problem for the well-known support vector machine (SVM), residual neural network (ResNet), and vision transformer (ViT) across hundreds of real-world classification datasets. The empirical results demonstrate its superiority over existing surrogate models, highlighting the effectiveness of meta-learning in hyperparameter optimization.
{"title":"Meta-Learning-Based Surrogate Models for Efficient Hyperparameter Optimization","authors":"Liping Deng;Maziar Raissi;MingQing Xiao","doi":"10.1109/TPAMI.2025.3630178","DOIUrl":"10.1109/TPAMI.2025.3630178","url":null,"abstract":"Sequential Model-Based Optimization (SMBO) is a highly effective strategy for hyperparameter search in machine learning. It utilizes a surrogate model that fits previous trials and approximates the hyperparameter response surface (performance). This surrogate model primarily guides the decision-making process for selecting the next set of hyperparameters. Existing classic surrogates, such as Gaussian processes and random forests, focus solely on the current task of interest and cannot incorporate trials from historical tasks. This limitation hinders their efficacy in various applications. Inspired by the state-of-the-art convolutional neural process, this paper proposes a novel meta-learning-based surrogate model for efficient and effective hyperparameter optimization. Our surrogate is trained on the meta-knowledge from a range of historical tasks, enabling it to accurately predict the hyperparameter response surface even with a limited number of trials on a new task. We tested our approach on the hyperparameter selection problem for the well-known support vector machine (SVM), residual neural network (ResNet), and vision transformer (ViT) across hundreds of real-world classification datasets. The empirical results demonstrate its superiority over existing surrogate models, highlighting the effectiveness of meta-learning in hyperparameter optimization.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3931-3938"},"PeriodicalIF":18.6,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145454731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27DOI: 10.1109/TPAMI.2025.3626134
Banglei Guan;Ji Zhao
We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem—where the scale transformation of multi-camera systems is unknown—we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy.
{"title":"Affine Correspondences Between Multi-Camera Systems for Relative Pose Estimation","authors":"Banglei Guan;Ji Zhao","doi":"10.1109/TPAMI.2025.3626134","DOIUrl":"10.1109/TPAMI.2025.3626134","url":null,"abstract":"We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem—where the scale transformation of multi-camera systems is unknown—we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2012-2029"},"PeriodicalIF":18.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.
{"title":"High-Resolution Open-Vocabulary Object 6D Pose Estimation","authors":"Jaime Corsetti;Davide Boscaini;Francesco Giuliari;Changjae Oh;Andrea Cavallaro;Fabio Poiesi","doi":"10.1109/TPAMI.2025.3624589","DOIUrl":"10.1109/TPAMI.2025.3624589","url":null,"abstract":"The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on <italic>all</i> datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2066-2077"},"PeriodicalIF":18.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.
{"title":"SMC++: Masked Learning of Unsupervised Video Semantic Compression","authors":"Yuan Tian;Xiaoyue Ling;Cong Geng;Qiang Hu;Guo Lu;Guangtao Zhai","doi":"10.1109/TPAMI.2025.3625063","DOIUrl":"10.1109/TPAMI.2025.3625063","url":null,"abstract":"Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1992-2011"},"PeriodicalIF":18.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145357314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unpaired image restoration (UIR) is a significant task due to the difficulty of acquiring paired degraded/clear images with identical backgrounds. In this paper, we propose a novel UIR method based on the assumption that an image contains both degradation-related features, which affect the level of degradation, and degradation-unrelated features, such as texture and semantic information. Our method aims to ensure that the degradation-related features of the restoration result closely resemble those of the clear image, while the degradation-unrelated features align with the input degraded image. Specifically, we introduce a Feature Orthogonalization Module optimized on Stiefel manifold to decouple image features, ensuring feature uncorrelation. A task-driven Depth-wise Feature Classifier is proposed to assign weights to uncorrelated features based on their relevance to degradation prediction. To avoid the dependence of the training process on the quality of the clear image in a single pair of input data, we propose to maintain several degradation-related proxies describing the degradation level of clear images to enhance the model’s robustness. Finally, a weighted PatchNCE loss is introduced to pull degradation-related features in the output image toward those of clear images, while bringing degradation-unrelated features close to those of the degraded input.
{"title":"Orthogonal Decoupling Contrastive Regularization: Toward Uncorrelated Feature Decoupling for Unpaired Image Restoration","authors":"Zhongze Wang;Jingchao Peng;Haitao Zhao;Lujian Yao;Kaijie Zhao","doi":"10.1109/TPAMI.2025.3620803","DOIUrl":"10.1109/TPAMI.2025.3620803","url":null,"abstract":"Unpaired image restoration (UIR) is a significant task due to the difficulty of acquiring paired degraded/clear images with identical backgrounds. In this paper, we propose a novel UIR method based on the assumption that an image contains both degradation-related features, which affect the level of degradation, and degradation-unrelated features, such as texture and semantic information. Our method aims to ensure that the degradation-related features of the restoration result closely resemble those of the clear image, while the degradation-unrelated features align with the input degraded image. Specifically, we introduce a Feature Orthogonalization Module optimized on Stiefel manifold to decouple image features, ensuring feature uncorrelation. A task-driven Depth-wise Feature Classifier is proposed to assign weights to uncorrelated features based on their relevance to degradation prediction. To avoid the dependence of the training process on the quality of the clear image in a single pair of input data, we propose to maintain several degradation-related proxies describing the degradation level of clear images to enhance the model’s robustness. Finally, a weighted PatchNCE loss is introduced to pull degradation-related features in the output image toward those of clear images, while bringing degradation-unrelated features close to those of the degraded input.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1842-1859"},"PeriodicalIF":18.6,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145331646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text-to-image customization aims to generate images that align with both the given text and the subject in the given image. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word inherently conflicts and entangles with other real words, resulting in a dual-optimum paradox between the subject similarity and text controllability. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to generate a coherent guidance image and corresponding subject mask, there by disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real text words; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask, and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. Extensive experiments validate RealCustom++s superior performance, which improves controllability by 7.48%, similarity by 3.04% and quality by 76.43% simultaneously. Moreover, RealCustom++ further improves controllability by 4.6% and multi-subject similarity by 6.34% for multisubject customization
{"title":"RealCustom++: Representing Images as Real Textual Word for Real-Time Customization","authors":"Zhendong Mao;Mengqi Huang;Fei Ding;Mingcong Liu;Qian He;Yongdong Zhang","doi":"10.1109/TPAMI.2025.3623025","DOIUrl":"10.1109/TPAMI.2025.3623025","url":null,"abstract":"Text-to-image customization aims to generate images that align with both the given text and the subject in the given image. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word inherently conflicts and entangles with other real words, resulting in a dual-optimum paradox between the subject similarity and text controllability. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to generate a coherent guidance image and corresponding subject mask, there by disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real text words; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask, and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. Extensive experiments validate RealCustom++s superior performance, which improves controllability by 7.48%, similarity by 3.04% and quality by 76.43% simultaneously. Moreover, RealCustom++ further improves controllability by 4.6% and multi-subject similarity by 6.34% for multisubject customization","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2078-2095"},"PeriodicalIF":18.6,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145310795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1109/TPAMI.2025.3621631
Jing Wang;Yongchao Xu;Jing Tang;Zeyu Gong;Bo Tao;Clarence W. de Silva;Xiang Bai
A central challenge in source-free domain adaptation (SFDA) is the lack of a theoretical framework for explicitly analyzing domain shifts, as the absence of source data prevents direct domain comparisons. In this paper, we introduce the Vicinal Gaussian Transform (VGT), an analytical operator that models source-informed latent vicinities as Gaussians and shows that vicinal prediction divergence is bounded by their covariance. By this formulation, SFDA can be reframed as shrinking covariance to reinforce label consistency. To operationalize this idea, we introduce the Energy-based VGT (EBVGT), a novel SDE that realizes the Gaussian transform by contracting covariance through a denoising mechanism. A recovery-likelihood with a Schrödinger-Bridge smoothness penalty denoises perturbed states, while a BYOL-derived energy function, directly obtained from model predictions, provides the score to guide label-consistent trajectories within the vicinity. This design not only yields noise-suppressed vicinal features for adaptation without source data, but also eliminates the need for additional learnable parameters for score estimation, in contrast to conventional deep SDEs. Our EBVGT is model- and modality-agnostic, efficient for classification, and improves state-of-the-art SFDA methods by 1.3–3.0% (2.0% on average) across both 2D image and 3D point cloud benchmarks.
{"title":"Vicinal Gaussian Transform: Rethinking Source-Free Domain Adaptation Through Source-Informed Label Consistency","authors":"Jing Wang;Yongchao Xu;Jing Tang;Zeyu Gong;Bo Tao;Clarence W. de Silva;Xiang Bai","doi":"10.1109/TPAMI.2025.3621631","DOIUrl":"10.1109/TPAMI.2025.3621631","url":null,"abstract":"A central challenge in source-free domain adaptation (SFDA) is the lack of a theoretical framework for explicitly analyzing domain shifts, as the absence of source data prevents direct domain comparisons. In this paper, we introduce the Vicinal Gaussian Transform (VGT), an analytical operator that models source-informed latent vicinities as Gaussians and shows that vicinal prediction divergence is bounded by their covariance. By this formulation, SFDA can be reframed as shrinking covariance to reinforce label consistency. To operationalize this idea, we introduce the Energy-based VGT (EBVGT), a novel SDE that realizes the Gaussian transform by contracting covariance through a denoising mechanism. A recovery-likelihood with a Schrödinger-Bridge smoothness penalty denoises perturbed states, while a BYOL-derived energy function, directly obtained from model predictions, provides the score to guide label-consistent trajectories within the vicinity. This design not only yields noise-suppressed vicinal features for adaptation without source data, but also eliminates the need for additional learnable parameters for score estimation, in contrast to conventional deep SDEs. Our EBVGT is model- and modality-agnostic, efficient for classification, and improves state-of-the-art SFDA methods by 1.3–3.0% (2.0% on average) across both 2D image and 3D point cloud benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2030-2047"},"PeriodicalIF":18.6,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145295630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation (NCE) has been proposed by formulating the objective as the logistic loss between the real data and the artificial noise. However, previous research indicates that NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study a direct approach for optimizing the negative log-likelihood of unnormalized models through the lens of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be expressed as a compositional function whose inner function can be estimated using stochastic samples. Consequently, the objective can be optimized via stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results in Gaussian mean estimation by showing our method has a much favorable loss landscape and enjoys faster convergence; (3) demonstrating better performance on various applications, including density estimation, out-of-distribution detection, and real image generation.
{"title":"Optimizing Unnormalized Statistical Models Through Compositional Optimization","authors":"Wei Jiang;Jiayu Qin;Lingyu Wu;Changyou Chen;Tianbao Yang;Lijun Zhang","doi":"10.1109/TPAMI.2025.3621320","DOIUrl":"10.1109/TPAMI.2025.3621320","url":null,"abstract":"Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation (NCE) has been proposed by formulating the objective as the logistic loss between the real data and the artificial noise. However, previous research indicates that NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study a direct approach for optimizing the negative log-likelihood of unnormalized models through the lens of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be expressed as a compositional function whose inner function can be estimated using stochastic samples. Consequently, the objective can be optimized via stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results in Gaussian mean estimation by showing our method has a much favorable loss landscape and enjoys faster convergence; (3) demonstrating better performance on various applications, including density estimation, out-of-distribution detection, and real image generation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1949-1960"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}