Pub Date : 2026-01-02DOI: 10.1109/TPAMI.2025.3650165
Zihui Zhang, Weisheng Dai, Bing Wang, Bo Li, Bo Yang
We study the problem of 3D semantic segmentation from raw point clouds. Unlike existing methods which primarily rely on a large amount of human annotations for training neural networks, we proposes GrowSP++, an unsupervised method to successfully identify complex semantic classes for every point in 3D scenes, without needing any type of human labels. Our method is composed of three major components: 1) a feature extractor incorporating 2D-3D feature distillation, 2) a superpoint constructor featuring progressively growing superpoints, and 3) a semantic primitive constructor with an additional growing strategy. The key to our method is the superpoint constructor together with the progressive growing strategy on both super points and semantic primitives, driving the feature extractor to progressively learn similar features for 3D points belonging to the same semantic class. We extensively evaluate our method on five challenging indoor and outdoor datasets, demonstrating state of-the-art performance over all unsupervised baselines. We hope our work could inspire more advanced methods for unsupervised 3D semantic learning.
{"title":"GrowSP++: Growing Superpoints and Primitives for Unsupervised 3D Semantic Segmentation.","authors":"Zihui Zhang, Weisheng Dai, Bing Wang, Bo Li, Bo Yang","doi":"10.1109/TPAMI.2025.3650165","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3650165","url":null,"abstract":"<p><p>We study the problem of 3D semantic segmentation from raw point clouds. Unlike existing methods which primarily rely on a large amount of human annotations for training neural networks, we proposes GrowSP++, an unsupervised method to successfully identify complex semantic classes for every point in 3D scenes, without needing any type of human labels. Our method is composed of three major components: 1) a feature extractor incorporating 2D-3D feature distillation, 2) a superpoint constructor featuring progressively growing superpoints, and 3) a semantic primitive constructor with an additional growing strategy. The key to our method is the superpoint constructor together with the progressive growing strategy on both super points and semantic primitives, driving the feature extractor to progressively learn similar features for 3D points belonging to the same semantic class. We extensively evaluate our method on five challenging indoor and outdoor datasets, demonstrating state of-the-art performance over all unsupervised baselines. We hope our work could inspire more advanced methods for unsupervised 3D semantic learning.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1109/TPAMI.2025.3649521
Ben Yang, Xuetao Zhang, Zhiyuan Xue, Feiping Nie, Badong Chen
Multi-view spectral clustering (MVSC) has garnered growing interest across various real-world applications, owing to its flexibility in managing diverse data space structures. Nevertheless, the fusion of multiple $ntimes n$ similarity matrices and the separate post- discretization process hinder the utilization of MVSC in large-scale tasks, where $n$ denotes the number of samples. Moreover, noise in different similarity matrices, along with the two-stage mismatch caused by the post- discretization, results in a reduction in clustering effectiveness. To overcome these challenges, we establish a novel fast multi-view discrete clustering (FMVDC) model via spectral embedding fusion, which integrates spectral embedding matrices ($ntimes c$, $cll n$) to directly obtain discrete sample categories, where $c$ indicates the number of clusters, bypassing the need for both similarity matrix fusion and post- discretization. To further enhance clustering efficiency, we employ an anchor-based spectral embedding strategy to decrease the computational complexity of spectral analysis from cubic to linear. Since gradient descent methods are incapable of discrete models, we propose a fast optimization strategy based on the coordinate descent method to solve the FMVDC model efficiently. Extensive studies demonstrate that FMVDC significantly improves clustering performance compared to existing state-of-the-art methods, particularly in large-scale clustering tasks.
{"title":"Fast Multi-View Discrete Clustering Via Spectral Embedding Fusion.","authors":"Ben Yang, Xuetao Zhang, Zhiyuan Xue, Feiping Nie, Badong Chen","doi":"10.1109/TPAMI.2025.3649521","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649521","url":null,"abstract":"<p><p>Multi-view spectral clustering (MVSC) has garnered growing interest across various real-world applications, owing to its flexibility in managing diverse data space structures. Nevertheless, the fusion of multiple $ntimes n$ similarity matrices and the separate post- discretization process hinder the utilization of MVSC in large-scale tasks, where $n$ denotes the number of samples. Moreover, noise in different similarity matrices, along with the two-stage mismatch caused by the post- discretization, results in a reduction in clustering effectiveness. To overcome these challenges, we establish a novel fast multi-view discrete clustering (FMVDC) model via spectral embedding fusion, which integrates spectral embedding matrices ($ntimes c$, $cll n$) to directly obtain discrete sample categories, where $c$ indicates the number of clusters, bypassing the need for both similarity matrix fusion and post- discretization. To further enhance clustering efficiency, we employ an anchor-based spectral embedding strategy to decrease the computational complexity of spectral analysis from cubic to linear. Since gradient descent methods are incapable of discrete models, we propose a fast optimization strategy based on the coordinate descent method to solve the FMVDC model efficiently. Extensive studies demonstrate that FMVDC significantly improves clustering performance compared to existing state-of-the-art methods, particularly in large-scale clustering tasks.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145879835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1109/TPAMI.2025.3649177
Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Dapeng Li, Huijiang Wang, Jiayu Chen, Bo Li, Wei Zhang, Wenjun Zeng, Hua Chen, Xin Jin
Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pre-training to learn reusable primitive skills and broad behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and regularly updated collection of BFM papers and projects to facilitate further research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.
{"title":"A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots.","authors":"Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Dapeng Li, Huijiang Wang, Jiayu Chen, Bo Li, Wei Zhang, Wenjun Zeng, Hua Chen, Xin Jin","doi":"10.1109/TPAMI.2025.3649177","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649177","url":null,"abstract":"<p><p>Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pre-training to learn reusable primitive skills and broad behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and regularly updated collection of BFM papers and projects to facilitate further research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145866739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adversarial-based framework neglects the discriminability of target-domain features, leading to suboptimal performance. To bridge this theoretical-practical gap, we defined "good representation learning" as guaranteeing both transferability and discriminability, and proved that an additional loss term targeting target-domain discriminability is necessary. Building on these insights, we proposed a novel adversarial-based UDA framework that explicitly integrates a domain alignment objective with a discriminability-enhancing constraint. Instantiated as Domain-Invariant Representation Learning with Global and Local Consistency (RLGLC), our method leverages Asymmetrically-Relaxed Wasserstein of Wasserstein Distance (AR-WWD) to address class imbalance and semantic dimension weighting, and employs a local consistency mechanism to preserve fine-grained target-domain discriminative information. Extensive experiments across multiple benchmark datasets demonstrate that RLGLC consistently surpasses state-of-the-art methods, confirming the value of our theoretical perspective and underscoring the necessity of enforcing both transferability and discriminability in adversarial-based UDA.
{"title":"On the Transferability and Discriminability of Representation Learning in Unsupervised Domain Adaptation.","authors":"Wenwen Qiang, Ziyin Gu, Lingyu Si, Jiangmeng Li, Changwen Zheng, Fuchun Sun, Hui Xiong","doi":"10.1109/TPAMI.2025.3649294","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649294","url":null,"abstract":"<p><p>In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adversarial-based framework neglects the discriminability of target-domain features, leading to suboptimal performance. To bridge this theoretical-practical gap, we defined \"good representation learning\" as guaranteeing both transferability and discriminability, and proved that an additional loss term targeting target-domain discriminability is necessary. Building on these insights, we proposed a novel adversarial-based UDA framework that explicitly integrates a domain alignment objective with a discriminability-enhancing constraint. Instantiated as Domain-Invariant Representation Learning with Global and Local Consistency (RLGLC), our method leverages Asymmetrically-Relaxed Wasserstein of Wasserstein Distance (AR-WWD) to address class imbalance and semantic dimension weighting, and employs a local consistency mechanism to preserve fine-grained target-domain discriminative information. Extensive experiments across multiple benchmark datasets demonstrate that RLGLC consistently surpasses state-of-the-art methods, confirming the value of our theoretical perspective and underscoring the necessity of enforcing both transferability and discriminability in adversarial-based UDA.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145866690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649001
Ziyang Gong, Zhixiang Wei, Di Wang, Xiaoxing Hu, Xianzheng Ma, Hongruixuan Chen, Yuru Jia, Yupeng Deng, Zhenming Ji, Xiangwei Zhu, Xue Yang, Naoto Yokoya, Jing Zhang, Bo Du, Junchi Yan, Liangpei Zhang
Due to the substantial domain gaps in Remote Sensing (RS) images that are characterized by variabilities such as location, wavelength, and sensor type, Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. However, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies target the RSDG issue, especially for semantic segmentation tasks. Existing related models are developed for specific unknown domains, struggling with issues of underfitting on other unseen scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 semantic segmentation scenarios across various regions, spectral bands, platforms, and climates, providing comprehensive evaluations of the generalizability of future RSDG models. Extensive experiments on this collection demonstrate the superiority of CrossEarth over existing state-of-the-art methods.
{"title":"CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation.","authors":"Ziyang Gong, Zhixiang Wei, Di Wang, Xiaoxing Hu, Xianzheng Ma, Hongruixuan Chen, Yuru Jia, Yupeng Deng, Zhenming Ji, Xiangwei Zhu, Xue Yang, Naoto Yokoya, Jing Zhang, Bo Du, Junchi Yan, Liangpei Zhang","doi":"10.1109/TPAMI.2025.3649001","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649001","url":null,"abstract":"<p><p>Due to the substantial domain gaps in Remote Sensing (RS) images that are characterized by variabilities such as location, wavelength, and sensor type, Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. However, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies target the RSDG issue, especially for semantic segmentation tasks. Existing related models are developed for specific unknown domains, struggling with issues of underfitting on other unseen scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 semantic segmentation scenarios across various regions, spectral bands, platforms, and climates, providing comprehensive evaluations of the generalizability of future RSDG models. Extensive experiments on this collection demonstrate the superiority of CrossEarth over existing state-of-the-art methods.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649078
Zhenyu Cui, Jiahuan Zhou, Yuxin Peng
Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. Specifically, a bidirectional compatible transfer network is first designed to bridge the relationship between new and old knowledge and continuously update the old gallery features to the new feature space after the updating. Secondly, a bidirectional compatible distillation module and a bidirectional anti-forgetting distillation model are designed to balance the compatibility between the new and old knowledge in dual feature spaces. Finally, a feature-level exponential moving average strategy is designed to adaptively fill the diverse knowledge gaps between different data domains. Finally, we verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.
{"title":"Bi-C<sup>2</sup>R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification.","authors":"Zhenyu Cui, Jiahuan Zhou, Yuxin Peng","doi":"10.1109/TPAMI.2025.3649078","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649078","url":null,"abstract":"<p><p>Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as \"re-indexing\". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C<sup>2</sup>R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. Specifically, a bidirectional compatible transfer network is first designed to bridge the relationship between new and old knowledge and continuously update the old gallery features to the new feature space after the updating. Secondly, a bidirectional compatible distillation module and a bidirectional anti-forgetting distillation model are designed to balance the compatibility between the new and old knowledge in dual feature spaces. Finally, a feature-level exponential moving average strategy is designed to adaptively fill the diverse knowledge gaps between different data domains. Finally, we verify our proposed Bi-C<sup>2</sup>R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649111
Long Lan, Jingyi Wang, Xinghao Wu, Bo Han, Xinwang Liu
Deep neural networks possess remarkable learning capabilities and expressive power, but this makes them vulnerable to overfitting, especially when they encounter mislabeled data. A notable phenomenon called the memorization effect occurs when networks first learn the correctly labeled data and later memorize the mislabeled instances. While early stopping can mitigate overfitting, it doesn't entirely prevent networks from adapting to incorrect labels during the initial training phases, which can result in losing valuable insights from accurate data. Moreover, early stopping cannot rectify the mistakes caused by mislabeled inputs, underscoring the need for improved strategies. In this paper, we introduce an innovative mechanism for continuous review and timely correction of learned knowledge. Our approach allows the network to repeatedly revisit and reinforce correct information while promptly addressing any inaccuracies stemming from mislabeled data. We present a novel method called self-not-true-distillation (SNTD). This technique employs self-distillation, where the network from previous training iterations acts as a teacher, guiding the current network to review and solidify its understanding of accurate labels. Crucially, SNTD masks the true class label in the logits during this process, concentrating on the non-true classes to correct any erroneous knowledge that may have been acquired. We also recognize that different data classes follow distinct learning trajectories. A single teacher network might struggle to effectively guide the learning of all classes at once, which necessitates selecting different teacher networks for each specific class. Additionally, the influence of the teacher network's guidance varies throughout the training process. To address these challenges, we propose SNTD+, which integrates a class-wise distillation strategy along with a dynamic weight adjustment mechanism. Together, these enhancements significantly bolster SNTD's robustness in tackling complex scenarios characterized by label noise.
{"title":"Continuous Review and Timely Correction: Enhancing the Resistance to Noisy Labels Via Self-Not-True and Class-Wise Distillation.","authors":"Long Lan, Jingyi Wang, Xinghao Wu, Bo Han, Xinwang Liu","doi":"10.1109/TPAMI.2025.3649111","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649111","url":null,"abstract":"<p><p>Deep neural networks possess remarkable learning capabilities and expressive power, but this makes them vulnerable to overfitting, especially when they encounter mislabeled data. A notable phenomenon called the memorization effect occurs when networks first learn the correctly labeled data and later memorize the mislabeled instances. While early stopping can mitigate overfitting, it doesn't entirely prevent networks from adapting to incorrect labels during the initial training phases, which can result in losing valuable insights from accurate data. Moreover, early stopping cannot rectify the mistakes caused by mislabeled inputs, underscoring the need for improved strategies. In this paper, we introduce an innovative mechanism for continuous review and timely correction of learned knowledge. Our approach allows the network to repeatedly revisit and reinforce correct information while promptly addressing any inaccuracies stemming from mislabeled data. We present a novel method called self-not-true-distillation (SNTD). This technique employs self-distillation, where the network from previous training iterations acts as a teacher, guiding the current network to review and solidify its understanding of accurate labels. Crucially, SNTD masks the true class label in the logits during this process, concentrating on the non-true classes to correct any erroneous knowledge that may have been acquired. We also recognize that different data classes follow distinct learning trajectories. A single teacher network might struggle to effectively guide the learning of all classes at once, which necessitates selecting different teacher networks for each specific class. Additionally, the influence of the teacher network's guidance varies throughout the training process. To address these challenges, we propose SNTD+, which integrates a class-wise distillation strategy along with a dynamic weight adjustment mechanism. Together, these enhancements significantly bolster SNTD's robustness in tackling complex scenarios characterized by label noise.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting binary spike streams. Optical flow estimation aims to estimate pixel-level correspondences between different moments, describing motion information along time, which is a key task of spike camera. High-quality optical flow is important since motion information is a foundation for analyzing spikes. However, extracting stable light-intensity information from spikes is difficult due to the randomness of binary spikes. Besides, the continuity of spikes can offer contextual information for optical flow. In this paper, we propose a network Spike2Flow++ to estimate optical flow for spike camera. In Spike2Flow++, we propose a differential of spike firing time (DSFT) to represent information in binary spikes. Moreover, we propose a dual DSFT representation and a dual correlation construction to extract stable light-intensity information for reliable correlations. To use the continuity of spikes as motion contextual information, we propose a joint correlation decoding (JCD) that jointly estimates a series of flow fields. To adaptively fuse different motions in JCD, we propose a global motion bank aggregation to construct an information bank for all motions and adaptively extract contexts from the bank for each iteration during recurrent decoding of each motion. To train and evaluate our network, we construct a real scene with spikes and flow++ (RSSF++) based on real-world scenes. Experiments demonstrate that our Spike2Flow++ achieves state-of-the-art performance on RSSF++, photo-realistic high-speed motion (PHM), and real-captured data.
{"title":"Spike Camera Optical Flow Estimation Based on Continuous Spike Streams.","authors":"Rui Zhao, Ruiqin Xiong, Dongkai Wang, Shiyu Xuan, Jian Zhang, Xiaopeng Fan, Tiejun Huang","doi":"10.1109/TPAMI.2025.3649050","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649050","url":null,"abstract":"<p><p>Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting binary spike streams. Optical flow estimation aims to estimate pixel-level correspondences between different moments, describing motion information along time, which is a key task of spike camera. High-quality optical flow is important since motion information is a foundation for analyzing spikes. However, extracting stable light-intensity information from spikes is difficult due to the randomness of binary spikes. Besides, the continuity of spikes can offer contextual information for optical flow. In this paper, we propose a network Spike2Flow++ to estimate optical flow for spike camera. In Spike2Flow++, we propose a differential of spike firing time (DSFT) to represent information in binary spikes. Moreover, we propose a dual DSFT representation and a dual correlation construction to extract stable light-intensity information for reliable correlations. To use the continuity of spikes as motion contextual information, we propose a joint correlation decoding (JCD) that jointly estimates a series of flow fields. To adaptively fuse different motions in JCD, we propose a global motion bank aggregation to construct an information bank for all motions and adaptively extract contexts from the bank for each iteration during recurrent decoding of each motion. To train and evaluate our network, we construct a real scene with spikes and flow++ (RSSF++) based on real-world scenes. Experiments demonstrate that our Spike2Flow++ achieves state-of-the-art performance on RSSF++, photo-realistic high-speed motion (PHM), and real-captured data.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper revisits the canonical concept of learning structured representations without label supervision by eigendecomposition. Yet, unlike prior spectral methods such as Laplacian Eigenmap which operate in a nonparametric manner, we aim to parametrically model the principal eigenfunctions of an integral operator defined by a kernel and a data distribution using a neural network for enhanced scalability and reasonable out-of-sample generalization. To achieve this goal, we first present a new series of objective functions that generalize the EigenGame [1] to function space for learning neural eigenfunctions. We then show that, when the similarity metric is derived from positive relations in a data augmentation setup, a representation learning objective function that resembles those of popular self-supervised learning methods emerges, with an additional symmetry-breaking property for producing structured representations where features are ordered by importance. We call such a structured, adaptive-length deep representation Neural Eigenmap. We demonstrate using Neural Eigenmap as adaptive-length codes in image retrieval systems. By truncation according to feature importance, our method requires up to $16times$ shorter representation length than leading self-supervised learning ones to achieve similar retrieval performance. We further apply our method to graph data and report strong results on a node representation learning benchmark with more than one million nodes.
{"title":"Neural Eigenfunctions Are Structured Representation Learners.","authors":"Zhijie Deng, Jiaxin Shi, Hao Zhang, Peng Cui, Cewu Lu, Jun Zhu","doi":"10.1109/TPAMI.2025.3625728","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3625728","url":null,"abstract":"<p><p>This paper revisits the canonical concept of learning structured representations without label supervision by eigendecomposition. Yet, unlike prior spectral methods such as Laplacian Eigenmap which operate in a nonparametric manner, we aim to parametrically model the principal eigenfunctions of an integral operator defined by a kernel and a data distribution using a neural network for enhanced scalability and reasonable out-of-sample generalization. To achieve this goal, we first present a new series of objective functions that generalize the EigenGame [1] to function space for learning neural eigenfunctions. We then show that, when the similarity metric is derived from positive relations in a data augmentation setup, a representation learning objective function that resembles those of popular self-supervised learning methods emerges, with an additional symmetry-breaking property for producing structured representations where features are ordered by importance. We call such a structured, adaptive-length deep representation Neural Eigenmap. We demonstrate using Neural Eigenmap as adaptive-length codes in image retrieval systems. By truncation according to feature importance, our method requires up to $16times$ shorter representation length than leading self-supervised learning ones to achieve similar retrieval performance. We further apply our method to graph data and report strong results on a node representation learning benchmark with more than one million nodes.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27DOI: 10.1109/TPAMI.2025.3626134
Banglei Guan;Ji Zhao
We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem—where the scale transformation of multi-camera systems is unknown—we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy.
{"title":"Affine Correspondences Between Multi-Camera Systems for Relative Pose Estimation","authors":"Banglei Guan;Ji Zhao","doi":"10.1109/TPAMI.2025.3626134","DOIUrl":"10.1109/TPAMI.2025.3626134","url":null,"abstract":"We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, we exploit a unified and versatile framework for generating 6DOF solvers. Building upon this foundation, we use this framework to address two categories of practical scenarios. First, for the more challenging 7DOF relative pose estimation problem—where the scale transformation of multi-camera systems is unknown—we propose 7DOF solvers to compute the relative pose and scale using three ACs. Second, leveraging inertial measurement units (IMUs), we introduce several minimal solvers for constrained relative pose estimation problems. These include 5DOF solvers with known relative rotation angle, and 4DOF solver with known vertical direction. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2012-2029"},"PeriodicalIF":18.6,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}