Pub Date : 2025-08-22DOI: 10.1109/TPAMI.2025.3602282
Tehrim Yoon;Minyoung Hwang;Eunho Yang
Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.
{"title":"VQ-FedDiff: Federated Learning Algorithm of Diffusion Models With Client-Specific Vector-Quantized Conditioning","authors":"Tehrim Yoon;Minyoung Hwang;Eunho Yang","doi":"10.1109/TPAMI.2025.3602282","DOIUrl":"10.1109/TPAMI.2025.3602282","url":null,"abstract":"Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11863-11873"},"PeriodicalIF":18.6,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-22DOI: 10.1109/TPAMI.2025.3597436
Xinyao Li;Jingjing Li;Zhekai Du;Lei Zhu;Heng Tao Shen
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.
{"title":"Unified Modality Separation: A Vision-Language Framework for Unsupervised Domain Adaptation","authors":"Xinyao Li;Jingjing Li;Zhekai Du;Lei Zhu;Heng Tao Shen","doi":"10.1109/TPAMI.2025.3597436","DOIUrl":"10.1109/TPAMI.2025.3597436","url":null,"abstract":"Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as <italic>modality gap</i>. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 11","pages":"10604-10618"},"PeriodicalIF":18.6,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose an optimal two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimator. We prove that the proposed estimator achieves the same asymptotic statistical properties as the ML estimator: The first is consistency, i.e., the estimator converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimator converges to the theoretical lower bound — Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.
{"title":"Consistent and Optimal Solution to Camera Motion Estimation","authors":"Guangyang Zeng;Qingcheng Zeng;Xinghan Li;Biqiang Mu;Jiming Chen;Ling Shi;Junfeng Wu","doi":"10.1109/TPAMI.2025.3601430","DOIUrl":"10.1109/TPAMI.2025.3601430","url":null,"abstract":"Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose an optimal two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimator. We prove that the proposed estimator achieves the same asymptotic statistical properties as the ML estimator: The first is consistency, i.e., the estimator converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimator converges to the theoretical lower bound — Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"12005-12020"},"PeriodicalIF":18.6,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While many efforts have focused on detecting swapped face images or videos, these methods are insufficient for tracing the malicious users behind fraudulent activities. Intrusive watermark-based approaches also fail to trace unmarked identities, limiting their practical utility. To address these challenges, we introduce FaceTracer, the first non-intrusive framework specifically designed to trace the identity of the source person from swapped face images or videos. Specifically, FaceTracer leverages a disentanglement module that effectively suppresses identity information related to the target person while isolating the identity features of the source person. This allows us to extract robust identity information that can directly link the swapped face back to the original individual, aiding in uncovering the actors behind fraudulent activities. Extensive experiments demonstrate FaceTracer’s effectiveness across various face-swapping techniques, successfully identifying the source person in swapped content and enabling the tracing of malicious actors involved in fraudulent activities. Additionally, FaceTracer shows strong transferability to unseen face-swapping methods including commercial applications and robustness against transmission distortions and adaptive attacks.
{"title":"FaceTracer: Unveiling Source Identities From Swapped Face Images and Videos for Fraud Prevention","authors":"Zhongyi Zhang;Jie Zhang;Wenbo Zhou;Xinghui Zhou;Qing Guo;Weiming Zhang;Tianwei Zhang;Nenghai Yu","doi":"10.1109/TPAMI.2025.3601141","DOIUrl":"10.1109/TPAMI.2025.3601141","url":null,"abstract":"Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While many efforts have focused on detecting swapped face images or videos, these methods are insufficient for tracing the malicious users behind fraudulent activities. Intrusive watermark-based approaches also fail to trace unmarked identities, limiting their practical utility. To address these challenges, we introduce <sc>FaceTracer</small>, the first non-intrusive framework specifically designed to trace the identity of the source person from swapped face images or videos. Specifically, <sc>FaceTracer</small> leverages a disentanglement module that effectively suppresses identity information related to the target person while isolating the identity features of the source person. This allows us to extract robust identity information that can directly link the swapped face back to the original individual, aiding in uncovering the actors behind fraudulent activities. Extensive experiments demonstrate <sc>FaceTracer</small>’s effectiveness across various face-swapping techniques, successfully identifying the source person in swapped content and enabling the tracing of malicious actors involved in fraudulent activities. Additionally, <sc>FaceTracer</small> shows strong transferability to unseen face-swapping methods including commercial applications and robustness against transmission distortions and adaptive attacks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"12021-12037"},"PeriodicalIF":18.6,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20DOI: 10.1109/TPAMI.2025.3600873
Oguzhan Yigit;Richard C. Wilson
The Laplace-Beltrami operator has established itself in the field of non-rigid shape analysis due to its many useful properties such as being invariant under isometric transformation, having a countable eigensystem forming an orthonormal basis, and fully characterizing geodesic distances of the manifold. However, this invariancy only applies under isometric deformations, which leads to a performance breakdown in many real-world applications. In recent years emphasis has been placed upon extracting optimal features using deep learning methods, however spectral signatures play a crucial role and still add value. In this paper we take a step back, revisiting the LBO and proposing a supervised way to learn several operators on a manifold. Depending on the task, by applying these functions, we can train the LBO eigenbasis to be more task-specific. The optimization of the LBO leads to enormous improvements to established descriptors such as the heat kernel signature in various tasks such as retrieval, classification, segmentation, and correspondence, proving the adaptation of the LBO eigenbasis to both global and highly local learning settings.
{"title":"LBONet: Supervised Spectral Descriptors for Shape Analysis","authors":"Oguzhan Yigit;Richard C. Wilson","doi":"10.1109/TPAMI.2025.3600873","DOIUrl":"10.1109/TPAMI.2025.3600873","url":null,"abstract":"The Laplace-Beltrami operator has established itself in the field of non-rigid shape analysis due to its many useful properties such as being invariant under isometric transformation, having a countable eigensystem forming an orthonormal basis, and fully characterizing geodesic distances of the manifold. However, this invariancy only applies under isometric deformations, which leads to a performance breakdown in many real-world applications. In recent years emphasis has been placed upon extracting optimal features using deep learning methods, however spectral signatures play a crucial role and still add value. In this paper we take a step back, revisiting the LBO and proposing a supervised way to learn several operators on a manifold. Depending on the task, by applying these functions, we can train the LBO eigenbasis to be more task-specific. The optimization of the LBO leads to enormous improvements to established descriptors such as the heat kernel signature in various tasks such as retrieval, classification, segmentation, and correspondence, proving the adaptation of the LBO eigenbasis to both global and highly local learning settings.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"12054-12068"},"PeriodicalIF":18.6,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1109/TPAMI.2025.3599949
Zhiming Chang, Boyang Liu, Yifei Xia, Youming Guo, Boxin Shi, He Sun
Monitoring space objects is crucial for space situational awareness, yet reconstructing 3D satellite models from ground-based telescope images is super challenging due to atmospheric turbulence, long observation distances, limited viewpoints, and low signal-to-noise ratios. In this paper, we propose a novel computational imaging framework that overcomes these obstacles by integrating a hybrid image pre-processing pipeline with a joint pose estimation and 3D reconstruction module based on controlled Gaussian Splatting (GS) and Branch-and-Bound (BnB) search. We validate our approach on both synthetic satellite datasets and on-sky observations of China's Tiangong Space Station and the International Space Station, achieving robust 3D reconstructions of low-Earth orbit satellites from ground-based data. Quantitative evaluations using SSIM, PSNR, LPIPS, and Chamfer Distance demonstrate that our method outperforms state-of-the-art NeRF-based approaches, and ablation studies confirm the critical role of each component. Our framework enables high-fidelity 3D satellite monitoring from Earth, offering a cost-effective alternative for space situational awareness. Project page: https://ai4scientificimaging.org/3DSatellites.
{"title":"Reconstructing Satellites in 3D from Amateur Telescope Images.","authors":"Zhiming Chang, Boyang Liu, Yifei Xia, Youming Guo, Boxin Shi, He Sun","doi":"10.1109/TPAMI.2025.3599949","DOIUrl":"10.1109/TPAMI.2025.3599949","url":null,"abstract":"<p><p>Monitoring space objects is crucial for space situational awareness, yet reconstructing 3D satellite models from ground-based telescope images is super challenging due to atmospheric turbulence, long observation distances, limited viewpoints, and low signal-to-noise ratios. In this paper, we propose a novel computational imaging framework that overcomes these obstacles by integrating a hybrid image pre-processing pipeline with a joint pose estimation and 3D reconstruction module based on controlled Gaussian Splatting (GS) and Branch-and-Bound (BnB) search. We validate our approach on both synthetic satellite datasets and on-sky observations of China's Tiangong Space Station and the International Space Station, achieving robust 3D reconstructions of low-Earth orbit satellites from ground-based data. Quantitative evaluations using SSIM, PSNR, LPIPS, and Chamfer Distance demonstrate that our method outperforms state-of-the-art NeRF-based approaches, and ablation studies confirm the critical role of each component. Our framework enables high-fidelity 3D satellite monitoring from Earth, offering a cost-effective alternative for space situational awareness. Project page: https://ai4scientificimaging.org/3DSatellites.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1109/TPAMI.2025.3599830
Liangchen Liu;Nannan Wang;Chen Chen;Decheng Liu;Xi Yang;Xinbo Gao;Tongliang Liu
This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks.
{"title":"Frequency-Based Comprehensive Prompt Learning for Vision-Language Models","authors":"Liangchen Liu;Nannan Wang;Chen Chen;Decheng Liu;Xi Yang;Xinbo Gao;Tongliang Liu","doi":"10.1109/TPAMI.2025.3599830","DOIUrl":"10.1109/TPAMI.2025.3599830","url":null,"abstract":"This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11974-11989"},"PeriodicalIF":18.6,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1109/TPAMI.2025.3600256
Wei Wei;Jianguo Wu;Xinyao Guo;Jing Yan;Jiye Liang
Existing clustering ensemble methods typically fuse all base clusterings in one shot under unsupervised settings, making it difficult to distinguish the quality of individual base clusterings and to exploit latent prior knowledge; consequently, their adaptability to data distributions and overall performance are limited. To address these issues, this paper proposes the Self-Constrained Clustering Ensemble (SCCE) algorithm. SCCE treats the pseudolabels automatically generated from current clustering results as selfsupervised signals and performs metric learning to obtain a linear transformation that enlarges interclass distances while compressing intraclass distances. The base clusterings are then reclustered in the new metric space to enhance separability and consistency. Afterward, ensemble updating is iteratively applied, forming a self-driven closed loop that continuously improves model performance. Theoretical analysis shows that the model converges efficiently via alternating optimization, with computational complexity on the same order as mainstream methods. Experiments on public datasets demonstrate that the proposed algorithm significantly outperforms representative clustering ensemble approaches, validating its effectiveness and robustness in scenarios lacking external supervision.
{"title":"Self-Constrained Clustering Ensemble","authors":"Wei Wei;Jianguo Wu;Xinyao Guo;Jing Yan;Jiye Liang","doi":"10.1109/TPAMI.2025.3600256","DOIUrl":"10.1109/TPAMI.2025.3600256","url":null,"abstract":"Existing clustering ensemble methods typically fuse all base clusterings in one shot under unsupervised settings, making it difficult to distinguish the quality of individual base clusterings and to exploit latent prior knowledge; consequently, their adaptability to data distributions and overall performance are limited. To address these issues, this paper proposes the Self-Constrained Clustering Ensemble (SCCE) algorithm. SCCE treats the pseudolabels automatically generated from current clustering results as selfsupervised signals and performs metric learning to obtain a linear transformation that enlarges interclass distances while compressing intraclass distances. The base clusterings are then reclustered in the new metric space to enhance separability and consistency. Afterward, ensemble updating is iteratively applied, forming a self-driven closed loop that continuously improves model performance. Theoretical analysis shows that the model converges efficiently via alternating optimization, with computational complexity on the same order as mainstream methods. Experiments on public datasets demonstrate that the proposed algorithm significantly outperforms representative clustering ensemble approaches, validating its effectiveness and robustness in scenarios lacking external supervision.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11946-11960"},"PeriodicalIF":18.6,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1109/TPAMI.2025.3600494
Luxi Chen, Zhengyi Wang, Zihan Zhou, Tingting Gao, Hang Su, Jun Zhu, Chongxuan Li
Optimization-based approaches, such as score distillation sampling (SDS), show promise in zero-shot 3D generation but suffer from low efficiency, primarily due to the high number of function evaluations (NFEs) required for each sample and the limitation of optimization confined to latent space. This paper introduces score-based iterative reconstruction (SIR), an efficient and general algorithm mimicking a differentiable 3D reconstruction process to reduce the NFEs and enable optimization in pixel space. Given a single set of images sampled from a multi-view score-based diffusion model, SIR repeatedly optimizes 3D parameters, unlike the single-step optimization in SDS. With other improvements in training, we present an efficient approach called MicroDreamer that generally applies to various 3D representations and 3D generation tasks. In particular, MicroDreamer is 5-20 times faster than SDS in generating neural radiance field while retaining a comparable performance and takes about 20 seconds to create meshes from 3D Gaussian splatting on a single A100 GPU, halving the time of the fastest optimization-based baseline DreamGaussian with significantly superior performance compared to the measurement standard deviation. Our code is available at https://github.com/ML-GSAI/MicroDreamer.
{"title":"MicroDreamer: Efficient 3D Generation in $sim$20 Seconds by Score-based Iterative Reconstruction.","authors":"Luxi Chen, Zhengyi Wang, Zihan Zhou, Tingting Gao, Hang Su, Jun Zhu, Chongxuan Li","doi":"10.1109/TPAMI.2025.3600494","DOIUrl":"10.1109/TPAMI.2025.3600494","url":null,"abstract":"<p><p>Optimization-based approaches, such as score distillation sampling (SDS), show promise in zero-shot 3D generation but suffer from low efficiency, primarily due to the high number of function evaluations (NFEs) required for each sample and the limitation of optimization confined to latent space. This paper introduces score-based iterative reconstruction (SIR), an efficient and general algorithm mimicking a differentiable 3D reconstruction process to reduce the NFEs and enable optimization in pixel space. Given a single set of images sampled from a multi-view score-based diffusion model, SIR repeatedly optimizes 3D parameters, unlike the single-step optimization in SDS. With other improvements in training, we present an efficient approach called MicroDreamer that generally applies to various 3D representations and 3D generation tasks. In particular, MicroDreamer is 5-20 times faster than SDS in generating neural radiance field while retaining a comparable performance and takes about 20 seconds to create meshes from 3D Gaussian splatting on a single A100 GPU, halving the time of the fastest optimization-based baseline DreamGaussian with significantly superior performance compared to the measurement standard deviation. Our code is available at https://github.com/ML-GSAI/MicroDreamer.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to localize all potential unseen/unknown objects and incrementally learn them. The large pre-trained vision-language grounding models (VLM, e.g., GLIP) have rich knowledge about the open world, but are limited by text prompts and cannot localize indescribable objects. However, there are many detection scenarios in which pre-defined language descriptions are unavailable during inference. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the simple knowledge distillation approach leads to unexpected performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures, leading to catastrophic damage to the model’s ability to learn about known objects. To alleviate these problems, we propose the down-weight training strategy for knowledge distillation from vision-language model to single visual modality one. Meanwhile, we propose the cascade decoupled decoders that decouple the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we refine the benchmark for evaluating the performance of unknown object detection by augmenting annotations for unknown objects which we name“IntensiveSet$scriptstylespadesuit$”. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods.
{"title":"SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-World Object Detector","authors":"Shuailei Ma;Yuefeng Wang;Ying Wei;Enming Zhang;Jiaqi Fan;Xinyu Sun;Peihao Chen","doi":"10.1109/TPAMI.2025.3600435","DOIUrl":"10.1109/TPAMI.2025.3600435","url":null,"abstract":"Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to localize all potential unseen/unknown objects and incrementally learn them. The large pre-trained vision-language grounding models (VLM, e.g., GLIP) have rich knowledge about the open world, but are limited by text prompts and cannot localize indescribable objects. However, there are many detection scenarios in which pre-defined language descriptions are unavailable during inference. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the simple <b>knowledge distillation</b> approach leads to unexpected performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures, leading to catastrophic damage to the model’s ability to learn about known objects. To alleviate these problems, we propose the <b>down-weight training strategy</b> for knowledge distillation from vision-language model to single visual modality one. Meanwhile, we propose the <b>cascade decoupled decoders</b> that decouple the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we refine the benchmark for evaluating the performance of unknown object detection by augmenting annotations for unknown objects which we name“<b>IntensiveSet</b><inline-formula><tex-math>$scriptstylespadesuit$</tex-math></inline-formula>”. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11738-11752"},"PeriodicalIF":18.6,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}