Pub Date : 2026-01-21DOI: 10.1109/TPAMI.2026.3656742
Rongxuan Peng, Shunquan Tan, Xianbo Mo, Alex C Kot, Jiwu Huang
Recent advances in deep learning have significantly propelled the development of image forgery localization. However, existing models remain highly vulnerable to adversarial attacks: imperceptible noise added to forged images can severely mislead these models. In this paper, we address this challenge with an Adversarial Noise Suppression Module (ANSM) that generates a defensive perturbation to suppress the attack effect of adversarial noise. We observe that forgery-relevant features extracted from adversarial and original forged images exhibit distinct distributions. To bridge this gap, we introduce Forgery-relevant Features Alignment (FFA) as a first-stage training strategy, which reduces distributional discrepancies by minimizing the channel-wise Kullback-Leibler divergence between these features. To further refine the defensive perturbation, we design a second-stage training strategy, termed Mask-guided Refinement (MgR), which incorporates a dual-mask constraint. MgR ensures that the defensive perturbation remains effective for both adversarial and original forged images, recovering forgery localization accuracy to their original level. Extensive experiments across various attack algorithms demonstrate that our method significantly restores the forgery localization model's performance on adversarial images. Notably, when ANSM is applied to original forged images, the performance remains nearly unaffected. To our best knowledge, this is the first report of adversarial defense in image forgery localization tasks. We have released the source code and anti-forensics dataset.
{"title":"Active Adversarial Noise Suppression for Image Forgery Localization.","authors":"Rongxuan Peng, Shunquan Tan, Xianbo Mo, Alex C Kot, Jiwu Huang","doi":"10.1109/TPAMI.2026.3656742","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3656742","url":null,"abstract":"<p><p>Recent advances in deep learning have significantly propelled the development of image forgery localization. However, existing models remain highly vulnerable to adversarial attacks: imperceptible noise added to forged images can severely mislead these models. In this paper, we address this challenge with an Adversarial Noise Suppression Module (ANSM) that generates a defensive perturbation to suppress the attack effect of adversarial noise. We observe that forgery-relevant features extracted from adversarial and original forged images exhibit distinct distributions. To bridge this gap, we introduce Forgery-relevant Features Alignment (FFA) as a first-stage training strategy, which reduces distributional discrepancies by minimizing the channel-wise Kullback-Leibler divergence between these features. To further refine the defensive perturbation, we design a second-stage training strategy, termed Mask-guided Refinement (MgR), which incorporates a dual-mask constraint. MgR ensures that the defensive perturbation remains effective for both adversarial and original forged images, recovering forgery localization accuracy to their original level. Extensive experiments across various attack algorithms demonstrate that our method significantly restores the forgery localization model's performance on adversarial images. Notably, when ANSM is applied to original forged images, the performance remains nearly unaffected. To our best knowledge, this is the first report of adversarial defense in image forgery localization tasks. We have released the source code and anti-forensics dataset.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TPAMI.2026.3654665
Fangjinhua Wang, Qingtian Zhu, Di Chang, Quankai Gao, Junlin Han, Tong Zhang, Richard Hartley, Marc Pollefeys
3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.
{"title":"Learning-Based Multi-View Stereo: A Survey.","authors":"Fangjinhua Wang, Qingtian Zhu, Di Chang, Quankai Gao, Junlin Han, Tong Zhang, Richard Hartley, Marc Pollefeys","doi":"10.1109/TPAMI.2026.3654665","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3654665","url":null,"abstract":"<p><p>3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1109/TPAMI.2025.3650165
Zihui Zhang, Weisheng Dai, Bing Wang, Bo Li, Bo Yang
We study the problem of 3D semantic segmentation from raw point clouds. Unlike existing methods which primarily rely on a large amount of human annotations for training neural networks, we proposes GrowSP++, an unsupervised method to successfully identify complex semantic classes for every point in 3D scenes, without needing any type of human labels. Our method is composed of three major components: 1) a feature extractor incorporating 2D-3D feature distillation, 2) a superpoint constructor featuring progressively growing superpoints, and 3) a semantic primitive constructor with an additional growing strategy. The key to our method is the superpoint constructor together with the progressive growing strategy on both super points and semantic primitives, driving the feature extractor to progressively learn similar features for 3D points belonging to the same semantic class. We extensively evaluate our method on five challenging indoor and outdoor datasets, demonstrating state of-the-art performance over all unsupervised baselines. We hope our work could inspire more advanced methods for unsupervised 3D semantic learning.
{"title":"GrowSP++: Growing Superpoints and Primitives for Unsupervised 3D Semantic Segmentation.","authors":"Zihui Zhang, Weisheng Dai, Bing Wang, Bo Li, Bo Yang","doi":"10.1109/TPAMI.2025.3650165","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3650165","url":null,"abstract":"<p><p>We study the problem of 3D semantic segmentation from raw point clouds. Unlike existing methods which primarily rely on a large amount of human annotations for training neural networks, we proposes GrowSP++, an unsupervised method to successfully identify complex semantic classes for every point in 3D scenes, without needing any type of human labels. Our method is composed of three major components: 1) a feature extractor incorporating 2D-3D feature distillation, 2) a superpoint constructor featuring progressively growing superpoints, and 3) a semantic primitive constructor with an additional growing strategy. The key to our method is the superpoint constructor together with the progressive growing strategy on both super points and semantic primitives, driving the feature extractor to progressively learn similar features for 3D points belonging to the same semantic class. We extensively evaluate our method on five challenging indoor and outdoor datasets, demonstrating state of-the-art performance over all unsupervised baselines. We hope our work could inspire more advanced methods for unsupervised 3D semantic learning.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1109/TPAMI.2025.3649521
Ben Yang, Xuetao Zhang, Zhiyuan Xue, Feiping Nie, Badong Chen
Multi-view spectral clustering (MVSC) has garnered growing interest across various real-world applications, owing to its flexibility in managing diverse data space structures. Nevertheless, the fusion of multiple $ntimes n$ similarity matrices and the separate post- discretization process hinder the utilization of MVSC in large-scale tasks, where $n$ denotes the number of samples. Moreover, noise in different similarity matrices, along with the two-stage mismatch caused by the post- discretization, results in a reduction in clustering effectiveness. To overcome these challenges, we establish a novel fast multi-view discrete clustering (FMVDC) model via spectral embedding fusion, which integrates spectral embedding matrices ($ntimes c$, $cll n$) to directly obtain discrete sample categories, where $c$ indicates the number of clusters, bypassing the need for both similarity matrix fusion and post- discretization. To further enhance clustering efficiency, we employ an anchor-based spectral embedding strategy to decrease the computational complexity of spectral analysis from cubic to linear. Since gradient descent methods are incapable of discrete models, we propose a fast optimization strategy based on the coordinate descent method to solve the FMVDC model efficiently. Extensive studies demonstrate that FMVDC significantly improves clustering performance compared to existing state-of-the-art methods, particularly in large-scale clustering tasks.
{"title":"Fast Multi-View Discrete Clustering Via Spectral Embedding Fusion.","authors":"Ben Yang, Xuetao Zhang, Zhiyuan Xue, Feiping Nie, Badong Chen","doi":"10.1109/TPAMI.2025.3649521","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649521","url":null,"abstract":"<p><p>Multi-view spectral clustering (MVSC) has garnered growing interest across various real-world applications, owing to its flexibility in managing diverse data space structures. Nevertheless, the fusion of multiple $ntimes n$ similarity matrices and the separate post- discretization process hinder the utilization of MVSC in large-scale tasks, where $n$ denotes the number of samples. Moreover, noise in different similarity matrices, along with the two-stage mismatch caused by the post- discretization, results in a reduction in clustering effectiveness. To overcome these challenges, we establish a novel fast multi-view discrete clustering (FMVDC) model via spectral embedding fusion, which integrates spectral embedding matrices ($ntimes c$, $cll n$) to directly obtain discrete sample categories, where $c$ indicates the number of clusters, bypassing the need for both similarity matrix fusion and post- discretization. To further enhance clustering efficiency, we employ an anchor-based spectral embedding strategy to decrease the computational complexity of spectral analysis from cubic to linear. Since gradient descent methods are incapable of discrete models, we propose a fast optimization strategy based on the coordinate descent method to solve the FMVDC model efficiently. Extensive studies demonstrate that FMVDC significantly improves clustering performance compared to existing state-of-the-art methods, particularly in large-scale clustering tasks.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145879835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1109/TPAMI.2025.3649177
Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Dapeng Li, Huijiang Wang, Jiayu Chen, Bo Li, Wei Zhang, Wenjun Zeng, Hua Chen, Xin Jin
Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pre-training to learn reusable primitive skills and broad behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and regularly updated collection of BFM papers and projects to facilitate further research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.
{"title":"A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots.","authors":"Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Dapeng Li, Huijiang Wang, Jiayu Chen, Bo Li, Wei Zhang, Wenjun Zeng, Hua Chen, Xin Jin","doi":"10.1109/TPAMI.2025.3649177","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649177","url":null,"abstract":"<p><p>Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pre-training to learn reusable primitive skills and broad behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and regularly updated collection of BFM papers and projects to facilitate further research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145866739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adversarial-based framework neglects the discriminability of target-domain features, leading to suboptimal performance. To bridge this theoretical-practical gap, we defined "good representation learning" as guaranteeing both transferability and discriminability, and proved that an additional loss term targeting target-domain discriminability is necessary. Building on these insights, we proposed a novel adversarial-based UDA framework that explicitly integrates a domain alignment objective with a discriminability-enhancing constraint. Instantiated as Domain-Invariant Representation Learning with Global and Local Consistency (RLGLC), our method leverages Asymmetrically-Relaxed Wasserstein of Wasserstein Distance (AR-WWD) to address class imbalance and semantic dimension weighting, and employs a local consistency mechanism to preserve fine-grained target-domain discriminative information. Extensive experiments across multiple benchmark datasets demonstrate that RLGLC consistently surpasses state-of-the-art methods, confirming the value of our theoretical perspective and underscoring the necessity of enforcing both transferability and discriminability in adversarial-based UDA.
{"title":"On the Transferability and Discriminability of Representation Learning in Unsupervised Domain Adaptation.","authors":"Wenwen Qiang, Ziyin Gu, Lingyu Si, Jiangmeng Li, Changwen Zheng, Fuchun Sun, Hui Xiong","doi":"10.1109/TPAMI.2025.3649294","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649294","url":null,"abstract":"<p><p>In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adversarial-based framework neglects the discriminability of target-domain features, leading to suboptimal performance. To bridge this theoretical-practical gap, we defined \"good representation learning\" as guaranteeing both transferability and discriminability, and proved that an additional loss term targeting target-domain discriminability is necessary. Building on these insights, we proposed a novel adversarial-based UDA framework that explicitly integrates a domain alignment objective with a discriminability-enhancing constraint. Instantiated as Domain-Invariant Representation Learning with Global and Local Consistency (RLGLC), our method leverages Asymmetrically-Relaxed Wasserstein of Wasserstein Distance (AR-WWD) to address class imbalance and semantic dimension weighting, and employs a local consistency mechanism to preserve fine-grained target-domain discriminative information. Extensive experiments across multiple benchmark datasets demonstrate that RLGLC consistently surpasses state-of-the-art methods, confirming the value of our theoretical perspective and underscoring the necessity of enforcing both transferability and discriminability in adversarial-based UDA.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145866690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649001
Ziyang Gong, Zhixiang Wei, Di Wang, Xiaoxing Hu, Xianzheng Ma, Hongruixuan Chen, Yuru Jia, Yupeng Deng, Zhenming Ji, Xiangwei Zhu, Xue Yang, Naoto Yokoya, Jing Zhang, Bo Du, Junchi Yan, Liangpei Zhang
Due to the substantial domain gaps in Remote Sensing (RS) images that are characterized by variabilities such as location, wavelength, and sensor type, Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. However, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies target the RSDG issue, especially for semantic segmentation tasks. Existing related models are developed for specific unknown domains, struggling with issues of underfitting on other unseen scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 semantic segmentation scenarios across various regions, spectral bands, platforms, and climates, providing comprehensive evaluations of the generalizability of future RSDG models. Extensive experiments on this collection demonstrate the superiority of CrossEarth over existing state-of-the-art methods.
{"title":"CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation.","authors":"Ziyang Gong, Zhixiang Wei, Di Wang, Xiaoxing Hu, Xianzheng Ma, Hongruixuan Chen, Yuru Jia, Yupeng Deng, Zhenming Ji, Xiangwei Zhu, Xue Yang, Naoto Yokoya, Jing Zhang, Bo Du, Junchi Yan, Liangpei Zhang","doi":"10.1109/TPAMI.2025.3649001","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649001","url":null,"abstract":"<p><p>Due to the substantial domain gaps in Remote Sensing (RS) images that are characterized by variabilities such as location, wavelength, and sensor type, Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. However, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies target the RSDG issue, especially for semantic segmentation tasks. Existing related models are developed for specific unknown domains, struggling with issues of underfitting on other unseen scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 semantic segmentation scenarios across various regions, spectral bands, platforms, and climates, providing comprehensive evaluations of the generalizability of future RSDG models. Extensive experiments on this collection demonstrate the superiority of CrossEarth over existing state-of-the-art methods.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649078
Zhenyu Cui, Jiahuan Zhou, Yuxin Peng
Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. Specifically, a bidirectional compatible transfer network is first designed to bridge the relationship between new and old knowledge and continuously update the old gallery features to the new feature space after the updating. Secondly, a bidirectional compatible distillation module and a bidirectional anti-forgetting distillation model are designed to balance the compatibility between the new and old knowledge in dual feature spaces. Finally, a feature-level exponential moving average strategy is designed to adaptively fill the diverse knowledge gaps between different data domains. Finally, we verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.
{"title":"Bi-C<sup>2</sup>R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification.","authors":"Zhenyu Cui, Jiahuan Zhou, Yuxin Peng","doi":"10.1109/TPAMI.2025.3649078","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649078","url":null,"abstract":"<p><p>Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as \"re-indexing\". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C<sup>2</sup>R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. Specifically, a bidirectional compatible transfer network is first designed to bridge the relationship between new and old knowledge and continuously update the old gallery features to the new feature space after the updating. Secondly, a bidirectional compatible distillation module and a bidirectional anti-forgetting distillation model are designed to balance the compatibility between the new and old knowledge in dual feature spaces. Finally, a feature-level exponential moving average strategy is designed to adaptively fill the diverse knowledge gaps between different data domains. Finally, we verify our proposed Bi-C<sup>2</sup>R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649111
Long Lan, Jingyi Wang, Xinghao Wu, Bo Han, Xinwang Liu
Deep neural networks possess remarkable learning capabilities and expressive power, but this makes them vulnerable to overfitting, especially when they encounter mislabeled data. A notable phenomenon called the memorization effect occurs when networks first learn the correctly labeled data and later memorize the mislabeled instances. While early stopping can mitigate overfitting, it doesn't entirely prevent networks from adapting to incorrect labels during the initial training phases, which can result in losing valuable insights from accurate data. Moreover, early stopping cannot rectify the mistakes caused by mislabeled inputs, underscoring the need for improved strategies. In this paper, we introduce an innovative mechanism for continuous review and timely correction of learned knowledge. Our approach allows the network to repeatedly revisit and reinforce correct information while promptly addressing any inaccuracies stemming from mislabeled data. We present a novel method called self-not-true-distillation (SNTD). This technique employs self-distillation, where the network from previous training iterations acts as a teacher, guiding the current network to review and solidify its understanding of accurate labels. Crucially, SNTD masks the true class label in the logits during this process, concentrating on the non-true classes to correct any erroneous knowledge that may have been acquired. We also recognize that different data classes follow distinct learning trajectories. A single teacher network might struggle to effectively guide the learning of all classes at once, which necessitates selecting different teacher networks for each specific class. Additionally, the influence of the teacher network's guidance varies throughout the training process. To address these challenges, we propose SNTD+, which integrates a class-wise distillation strategy along with a dynamic weight adjustment mechanism. Together, these enhancements significantly bolster SNTD's robustness in tackling complex scenarios characterized by label noise.
{"title":"Continuous Review and Timely Correction: Enhancing the Resistance to Noisy Labels Via Self-Not-True and Class-Wise Distillation.","authors":"Long Lan, Jingyi Wang, Xinghao Wu, Bo Han, Xinwang Liu","doi":"10.1109/TPAMI.2025.3649111","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649111","url":null,"abstract":"<p><p>Deep neural networks possess remarkable learning capabilities and expressive power, but this makes them vulnerable to overfitting, especially when they encounter mislabeled data. A notable phenomenon called the memorization effect occurs when networks first learn the correctly labeled data and later memorize the mislabeled instances. While early stopping can mitigate overfitting, it doesn't entirely prevent networks from adapting to incorrect labels during the initial training phases, which can result in losing valuable insights from accurate data. Moreover, early stopping cannot rectify the mistakes caused by mislabeled inputs, underscoring the need for improved strategies. In this paper, we introduce an innovative mechanism for continuous review and timely correction of learned knowledge. Our approach allows the network to repeatedly revisit and reinforce correct information while promptly addressing any inaccuracies stemming from mislabeled data. We present a novel method called self-not-true-distillation (SNTD). This technique employs self-distillation, where the network from previous training iterations acts as a teacher, guiding the current network to review and solidify its understanding of accurate labels. Crucially, SNTD masks the true class label in the logits during this process, concentrating on the non-true classes to correct any erroneous knowledge that may have been acquired. We also recognize that different data classes follow distinct learning trajectories. A single teacher network might struggle to effectively guide the learning of all classes at once, which necessitates selecting different teacher networks for each specific class. Additionally, the influence of the teacher network's guidance varies throughout the training process. To address these challenges, we propose SNTD+, which integrates a class-wise distillation strategy along with a dynamic weight adjustment mechanism. Together, these enhancements significantly bolster SNTD's robustness in tackling complex scenarios characterized by label noise.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting binary spike streams. Optical flow estimation aims to estimate pixel-level correspondences between different moments, describing motion information along time, which is a key task of spike camera. High-quality optical flow is important since motion information is a foundation for analyzing spikes. However, extracting stable light-intensity information from spikes is difficult due to the randomness of binary spikes. Besides, the continuity of spikes can offer contextual information for optical flow. In this paper, we propose a network Spike2Flow++ to estimate optical flow for spike camera. In Spike2Flow++, we propose a differential of spike firing time (DSFT) to represent information in binary spikes. Moreover, we propose a dual DSFT representation and a dual correlation construction to extract stable light-intensity information for reliable correlations. To use the continuity of spikes as motion contextual information, we propose a joint correlation decoding (JCD) that jointly estimates a series of flow fields. To adaptively fuse different motions in JCD, we propose a global motion bank aggregation to construct an information bank for all motions and adaptively extract contexts from the bank for each iteration during recurrent decoding of each motion. To train and evaluate our network, we construct a real scene with spikes and flow++ (RSSF++) based on real-world scenes. Experiments demonstrate that our Spike2Flow++ achieves state-of-the-art performance on RSSF++, photo-realistic high-speed motion (PHM), and real-captured data.
{"title":"Spike Camera Optical Flow Estimation Based on Continuous Spike Streams.","authors":"Rui Zhao, Ruiqin Xiong, Dongkai Wang, Shiyu Xuan, Jian Zhang, Xiaopeng Fan, Tiejun Huang","doi":"10.1109/TPAMI.2025.3649050","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649050","url":null,"abstract":"<p><p>Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting binary spike streams. Optical flow estimation aims to estimate pixel-level correspondences between different moments, describing motion information along time, which is a key task of spike camera. High-quality optical flow is important since motion information is a foundation for analyzing spikes. However, extracting stable light-intensity information from spikes is difficult due to the randomness of binary spikes. Besides, the continuity of spikes can offer contextual information for optical flow. In this paper, we propose a network Spike2Flow++ to estimate optical flow for spike camera. In Spike2Flow++, we propose a differential of spike firing time (DSFT) to represent information in binary spikes. Moreover, we propose a dual DSFT representation and a dual correlation construction to extract stable light-intensity information for reliable correlations. To use the continuity of spikes as motion contextual information, we propose a joint correlation decoding (JCD) that jointly estimates a series of flow fields. To adaptively fuse different motions in JCD, we propose a global motion bank aggregation to construct an information bank for all motions and adaptively extract contexts from the bank for each iteration during recurrent decoding of each motion. To train and evaluate our network, we construct a real scene with spikes and flow++ (RSSF++) based on real-world scenes. Experiments demonstrate that our Spike2Flow++ achieves state-of-the-art performance on RSSF++, photo-realistic high-speed motion (PHM), and real-captured data.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}