Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649001
Ziyang Gong, Zhixiang Wei, Di Wang, Xiaoxing Hu, Xianzheng Ma, Hongruixuan Chen, Yuru Jia, Yupeng Deng, Zhenming Ji, Xiangwei Zhu, Xue Yang, Naoto Yokoya, Jing Zhang, Bo Du, Junchi Yan, Liangpei Zhang
Due to the substantial domain gaps in Remote Sensing (RS) images that are characterized by variabilities such as location, wavelength, and sensor type, Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. However, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies target the RSDG issue, especially for semantic segmentation tasks. Existing related models are developed for specific unknown domains, struggling with issues of underfitting on other unseen scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 semantic segmentation scenarios across various regions, spectral bands, platforms, and climates, providing comprehensive evaluations of the generalizability of future RSDG models. Extensive experiments on this collection demonstrate the superiority of CrossEarth over existing state-of-the-art methods.
{"title":"CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation.","authors":"Ziyang Gong, Zhixiang Wei, Di Wang, Xiaoxing Hu, Xianzheng Ma, Hongruixuan Chen, Yuru Jia, Yupeng Deng, Zhenming Ji, Xiangwei Zhu, Xue Yang, Naoto Yokoya, Jing Zhang, Bo Du, Junchi Yan, Liangpei Zhang","doi":"10.1109/TPAMI.2025.3649001","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649001","url":null,"abstract":"<p><p>Due to the substantial domain gaps in Remote Sensing (RS) images that are characterized by variabilities such as location, wavelength, and sensor type, Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. However, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies target the RSDG issue, especially for semantic segmentation tasks. Existing related models are developed for specific unknown domains, struggling with issues of underfitting on other unseen scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 semantic segmentation scenarios across various regions, spectral bands, platforms, and climates, providing comprehensive evaluations of the generalizability of future RSDG models. Extensive experiments on this collection demonstrate the superiority of CrossEarth over existing state-of-the-art methods.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649078
Zhenyu Cui, Jiahuan Zhou, Yuxin Peng
Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. Specifically, a bidirectional compatible transfer network is first designed to bridge the relationship between new and old knowledge and continuously update the old gallery features to the new feature space after the updating. Secondly, a bidirectional compatible distillation module and a bidirectional anti-forgetting distillation model are designed to balance the compatibility between the new and old knowledge in dual feature spaces. Finally, a feature-level exponential moving average strategy is designed to adaptively fill the diverse knowledge gaps between different data domains. Finally, we verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.
{"title":"Bi-C<sup>2</sup>R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification.","authors":"Zhenyu Cui, Jiahuan Zhou, Yuxin Peng","doi":"10.1109/TPAMI.2025.3649078","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649078","url":null,"abstract":"<p><p>Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as \"re-indexing\". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C<sup>2</sup>R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. Specifically, a bidirectional compatible transfer network is first designed to bridge the relationship between new and old knowledge and continuously update the old gallery features to the new feature space after the updating. Secondly, a bidirectional compatible distillation module and a bidirectional anti-forgetting distillation model are designed to balance the compatibility between the new and old knowledge in dual feature spaces. Finally, a feature-level exponential moving average strategy is designed to adaptively fill the diverse knowledge gaps between different data domains. Finally, we verify our proposed Bi-C<sup>2</sup>R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1109/TPAMI.2025.3649111
Long Lan, Jingyi Wang, Xinghao Wu, Bo Han, Xinwang Liu
Deep neural networks possess remarkable learning capabilities and expressive power, but this makes them vulnerable to overfitting, especially when they encounter mislabeled data. A notable phenomenon called the memorization effect occurs when networks first learn the correctly labeled data and later memorize the mislabeled instances. While early stopping can mitigate overfitting, it doesn't entirely prevent networks from adapting to incorrect labels during the initial training phases, which can result in losing valuable insights from accurate data. Moreover, early stopping cannot rectify the mistakes caused by mislabeled inputs, underscoring the need for improved strategies. In this paper, we introduce an innovative mechanism for continuous review and timely correction of learned knowledge. Our approach allows the network to repeatedly revisit and reinforce correct information while promptly addressing any inaccuracies stemming from mislabeled data. We present a novel method called self-not-true-distillation (SNTD). This technique employs self-distillation, where the network from previous training iterations acts as a teacher, guiding the current network to review and solidify its understanding of accurate labels. Crucially, SNTD masks the true class label in the logits during this process, concentrating on the non-true classes to correct any erroneous knowledge that may have been acquired. We also recognize that different data classes follow distinct learning trajectories. A single teacher network might struggle to effectively guide the learning of all classes at once, which necessitates selecting different teacher networks for each specific class. Additionally, the influence of the teacher network's guidance varies throughout the training process. To address these challenges, we propose SNTD+, which integrates a class-wise distillation strategy along with a dynamic weight adjustment mechanism. Together, these enhancements significantly bolster SNTD's robustness in tackling complex scenarios characterized by label noise.
{"title":"Continuous Review and Timely Correction: Enhancing the Resistance to Noisy Labels Via Self-Not-True and Class-Wise Distillation.","authors":"Long Lan, Jingyi Wang, Xinghao Wu, Bo Han, Xinwang Liu","doi":"10.1109/TPAMI.2025.3649111","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649111","url":null,"abstract":"<p><p>Deep neural networks possess remarkable learning capabilities and expressive power, but this makes them vulnerable to overfitting, especially when they encounter mislabeled data. A notable phenomenon called the memorization effect occurs when networks first learn the correctly labeled data and later memorize the mislabeled instances. While early stopping can mitigate overfitting, it doesn't entirely prevent networks from adapting to incorrect labels during the initial training phases, which can result in losing valuable insights from accurate data. Moreover, early stopping cannot rectify the mistakes caused by mislabeled inputs, underscoring the need for improved strategies. In this paper, we introduce an innovative mechanism for continuous review and timely correction of learned knowledge. Our approach allows the network to repeatedly revisit and reinforce correct information while promptly addressing any inaccuracies stemming from mislabeled data. We present a novel method called self-not-true-distillation (SNTD). This technique employs self-distillation, where the network from previous training iterations acts as a teacher, guiding the current network to review and solidify its understanding of accurate labels. Crucially, SNTD masks the true class label in the logits during this process, concentrating on the non-true classes to correct any erroneous knowledge that may have been acquired. We also recognize that different data classes follow distinct learning trajectories. A single teacher network might struggle to effectively guide the learning of all classes at once, which necessitates selecting different teacher networks for each specific class. Additionally, the influence of the teacher network's guidance varies throughout the training process. To address these challenges, we propose SNTD+, which integrates a class-wise distillation strategy along with a dynamic weight adjustment mechanism. Together, these enhancements significantly bolster SNTD's robustness in tackling complex scenarios characterized by label noise.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting binary spike streams. Optical flow estimation aims to estimate pixel-level correspondences between different moments, describing motion information along time, which is a key task of spike camera. High-quality optical flow is important since motion information is a foundation for analyzing spikes. However, extracting stable light-intensity information from spikes is difficult due to the randomness of binary spikes. Besides, the continuity of spikes can offer contextual information for optical flow. In this paper, we propose a network Spike2Flow++ to estimate optical flow for spike camera. In Spike2Flow++, we propose a differential of spike firing time (DSFT) to represent information in binary spikes. Moreover, we propose a dual DSFT representation and a dual correlation construction to extract stable light-intensity information for reliable correlations. To use the continuity of spikes as motion contextual information, we propose a joint correlation decoding (JCD) that jointly estimates a series of flow fields. To adaptively fuse different motions in JCD, we propose a global motion bank aggregation to construct an information bank for all motions and adaptively extract contexts from the bank for each iteration during recurrent decoding of each motion. To train and evaluate our network, we construct a real scene with spikes and flow++ (RSSF++) based on real-world scenes. Experiments demonstrate that our Spike2Flow++ achieves state-of-the-art performance on RSSF++, photo-realistic high-speed motion (PHM), and real-captured data.
{"title":"Spike Camera Optical Flow Estimation Based on Continuous Spike Streams.","authors":"Rui Zhao, Ruiqin Xiong, Dongkai Wang, Shiyu Xuan, Jian Zhang, Xiaopeng Fan, Tiejun Huang","doi":"10.1109/TPAMI.2025.3649050","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3649050","url":null,"abstract":"<p><p>Spike camera is an emerging bio-inspired vision sensor with ultra-high temporal resolution. It records scenes by accumulating photons and outputting binary spike streams. Optical flow estimation aims to estimate pixel-level correspondences between different moments, describing motion information along time, which is a key task of spike camera. High-quality optical flow is important since motion information is a foundation for analyzing spikes. However, extracting stable light-intensity information from spikes is difficult due to the randomness of binary spikes. Besides, the continuity of spikes can offer contextual information for optical flow. In this paper, we propose a network Spike2Flow++ to estimate optical flow for spike camera. In Spike2Flow++, we propose a differential of spike firing time (DSFT) to represent information in binary spikes. Moreover, we propose a dual DSFT representation and a dual correlation construction to extract stable light-intensity information for reliable correlations. To use the continuity of spikes as motion contextual information, we propose a joint correlation decoding (JCD) that jointly estimates a series of flow fields. To adaptively fuse different motions in JCD, we propose a global motion bank aggregation to construct an information bank for all motions and adaptively extract contexts from the bank for each iteration during recurrent decoding of each motion. To train and evaluate our network, we construct a real scene with spikes and flow++ (RSSF++) based on real-world scenes. Experiments demonstrate that our Spike2Flow++ achieves state-of-the-art performance on RSSF++, photo-realistic high-speed motion (PHM), and real-captured data.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-10DOI: 10.1109/TPAMI.2025.3642123
Jiacheng Zhang;Jiaming Li;Xiangru Lin;Wei Zhang;Xiao Tan;Hongbo Gao;Jingdong Wang;Guanbin Li
Semi-supervised object detection (SSOD) mitigates the annotation burden in object detection by leveraging unlabeled data, providing a scalable solution for modern perception systems. Concurrently, detection transformers (DETRs) have emerged as a popular end-to-end framework, offering advantages such as non-maximum suppression (NMS)-free inference. However, existing SSOD methods are predominantly designed for conventional detectors, leaving the exploration of DETR-based SSOD largely uncharted. This paper presents a systematic study to bridge this gap. We begin by identifying two principal obstacles in semi-supervised DETR training: (1) the inherent one-to-one assignment mechanism of DETRs is highly sensitive to noisy pseudo-labels, which impedes training efficiency; and (2) the query-based decoder architecture complicates the design of an effective consistency regularization scheme, limiting further performance gains. To address these challenges, we propose Semi-DETR++, a novel framework for efficient SSOD with DETRs. Our approach introduces a stage-wise hybrid matching strategy that enhances robustness to noisy pseudo-labels by synergistically combining one-to-many and one-to-one assignments while preserving NMS-free inference. Furthermore, based on our observation of the unique layer-wise decoding behavior in DETRs, we develop a simple yet effective re-decode query consistency training method to regularize the decoder. Extensive experiments demonstrate that Semi-DETR++ enables more efficient semi-supervised learning across various DETR architectures, outperforming existing methods by significant margins. The proposed components are also flexible and versatile, showing superior generalization by readily extending to semi-supervised segmentation tasks.
{"title":"Toward Efficient Semi-Supervised Object Detection With Detection Transformer","authors":"Jiacheng Zhang;Jiaming Li;Xiangru Lin;Wei Zhang;Xiao Tan;Hongbo Gao;Jingdong Wang;Guanbin Li","doi":"10.1109/TPAMI.2025.3642123","DOIUrl":"10.1109/TPAMI.2025.3642123","url":null,"abstract":"Semi-supervised object detection (SSOD) mitigates the annotation burden in object detection by leveraging unlabeled data, providing a scalable solution for modern perception systems. Concurrently, detection transformers (DETRs) have emerged as a popular end-to-end framework, offering advantages such as non-maximum suppression (NMS)-free inference. However, existing SSOD methods are predominantly designed for conventional detectors, leaving the exploration of DETR-based SSOD largely uncharted. This paper presents a systematic study to bridge this gap. We begin by identifying two principal obstacles in semi-supervised DETR training: (1) the inherent one-to-one assignment mechanism of DETRs is highly sensitive to noisy pseudo-labels, which impedes training efficiency; and (2) the query-based decoder architecture complicates the design of an effective consistency regularization scheme, limiting further performance gains. To address these challenges, we propose Semi-DETR++, a novel framework for efficient SSOD with DETRs. Our approach introduces a stage-wise hybrid matching strategy that enhances robustness to noisy pseudo-labels by synergistically combining one-to-many and one-to-one assignments while preserving NMS-free inference. Furthermore, based on our observation of the unique layer-wise decoding behavior in DETRs, we develop a simple yet effective re-decode query consistency training method to regularize the decoder. Extensive experiments demonstrate that Semi-DETR++ enables more efficient semi-supervised learning across various DETR architectures, outperforming existing methods by significant margins. The proposed components are also flexible and versatile, showing superior generalization by readily extending to semi-supervised segmentation tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3765-3782"},"PeriodicalIF":18.6,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145717354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1109/TPAMI.2025.3640709
Zhiyuan Zhai;Xiaojun Yuan;Xin Wang;Geoffrey Ye Li
Decentralized federated learning (DFL) is an emerging paradigm to enable edge devices collaboratively training a learning model using a device-to-device (D2D) communication manner without the coordination of a parameter server (PS). Aggregation weights, also known as mixing weights, are crucial in DFL process, and impact the learning efficiency and accuracy. Conventional design relies on a so-called central entity to collect all local information and conduct system optimization to obtain appropriate weights. In this paper, we develop a distributed aggregation weight optimization algorithm to align with the decentralized nature of DFL. We analyze convergence by quantitatively capturing the impact of the aggregation weights over decentralized communication networks. Based on the analysis, we then formulate a learning performance optimization problem by designing the aggregation weights to minimize the derived convergence bound. The optimization problem is further transformed as an eigenvalue optimization problem and solved by our proposed subgradient-based algorithm in a distributed fashion. In our algorithm, edge devices only need local information to obtain the optimal aggregation weights through local (D2D) communications, just like the learning itself. Therefore, the optimization, communication, and learning process can be all conducted in a distributed fashion, which leads to a genuinely distributed DFL system. Numerical results demonstrate the superiority of the proposed algorithm in practical DFL deployment.
{"title":"Decentralized Federated Learning With Distributed Aggregation Weight Optimization","authors":"Zhiyuan Zhai;Xiaojun Yuan;Xin Wang;Geoffrey Ye Li","doi":"10.1109/TPAMI.2025.3640709","DOIUrl":"10.1109/TPAMI.2025.3640709","url":null,"abstract":"Decentralized federated learning (DFL) is an emerging paradigm to enable edge devices collaboratively training a learning model using a device-to-device (D2D) communication manner without the coordination of a parameter server (PS). Aggregation weights, also known as mixing weights, are crucial in DFL process, and impact the learning efficiency and accuracy. Conventional design relies on a so-called central entity to collect all local information and conduct system optimization to obtain appropriate weights. In this paper, we develop a distributed aggregation weight optimization algorithm to align with the decentralized nature of DFL. We analyze convergence by quantitatively capturing the impact of the aggregation weights over decentralized communication networks. Based on the analysis, we then formulate a learning performance optimization problem by designing the aggregation weights to minimize the derived convergence bound. The optimization problem is further transformed as an eigenvalue optimization problem and solved by our proposed subgradient-based algorithm in a distributed fashion. In our algorithm, edge devices only need local information to obtain the optimal aggregation weights through local (D2D) communications, just like the learning itself. Therefore, the optimization, communication, and learning process can be all conducted in a distributed fashion, which leads to a genuinely distributed DFL system. Numerical results demonstrate the superiority of the proposed algorithm in practical DFL deployment.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3899-3910"},"PeriodicalIF":18.6,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%.
{"title":"Harnessing Lightweight Transformer With Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation","authors":"Xinyu Liu;Zhen Chen;Wuyang Li;Chenxin Li;Yixuan Yuan","doi":"10.1109/TPAMI.2025.3640233","DOIUrl":"10.1109/TPAMI.2025.3640233","url":null,"abstract":"Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the <italic>extrinsic contextual information</i> to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes <italic>intrinsic contextual information</i> to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3852-3867"},"PeriodicalIF":18.6,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1109/TPAMI.2025.3640697
Mengyuan Liu;Jinfu Liu;Yongkang Jiang;Bin He
Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and uncrewed aerial vehicles (UAV)-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods.
{"title":"Heatmap Pooling Network for Action Recognition From RGB Videos","authors":"Mengyuan Liu;Jinfu Liu;Yongkang Jiang;Bin He","doi":"10.1109/TPAMI.2025.3640697","DOIUrl":"10.1109/TPAMI.2025.3640697","url":null,"abstract":"Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and uncrewed aerial vehicles (UAV)-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3726-3743"},"PeriodicalIF":18.6,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1109/TPAMI.2025.3640589
Xiang Xu;Lingdong Kong;Hui Shuai;Wenwei Zhang;Liang Pan;Kai Chen;Ziwei Liu;Qingshan Liu
LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
{"title":"Enhanced Spatiotemporal Consistency for Image-to-LiDAR Data Pretraining","authors":"Xiang Xu;Lingdong Kong;Hui Shuai;Wenwei Zhang;Liang Pan;Kai Chen;Ziwei Liu;Qingshan Liu","doi":"10.1109/TPAMI.2025.3640589","DOIUrl":"10.1109/TPAMI.2025.3640589","url":null,"abstract":"LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose <bold>SuperFlow++</b>, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: <bold>(1)</b> a view consistency alignment module to unify semantic information across camera views, <bold>(2)</b> a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, <bold>(3)</b> a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and <bold>(4)</b> a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3819-3834"},"PeriodicalIF":18.6,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Neural Networks (GNNs) have achieved remarkable success in machine learning tasks by learning the features of graph data. However, experiments show that vanilla GNNs fail to achieve good classification performance in the field of graph anomaly detection. To address this issue, we propose and theoretically prove that the high-Class Homophily Variance (CHV) characteristic is the reason behind the suboptimal performance of GNN models in anomaly detection tasks. Statistical analysis shows that in most standard node classification datasets, homophily levels are similar across all classes, so CHV is low. In contrast, graph anomaly detection datasets have high CHV, as benign nodes are highly homophilic while anomalies are not, leading to a clear separation. To mitigate its impact, we propose a novel GNN model named Homophily Edge Augment Graph Neural Network (HEAug). Different from previous work, our method emphasizes generating new edges with low CHV value, using the original edges as an auxiliary. HEAug samples homophily adjacency matrices from scratch using a self-attention mechanism, and leverages nodes that are relevant in the feature space but not directly connected in the original graph. Additionally, we modify the loss function to punish the generation of unnecessary heterophilic edges by the model. Extensive comparison experiments demonstrate that HEAug achieved the best performance across eight benchmark datasets, including anomaly detection, edgeless node classification and adversarial attack. We also defined a heterophily attack to increase the CHV value in other graphs, demonstrating the effectiveness of our theory and model in various scenarios.
{"title":"Homophily Edge Augment Graph Neural Network for High-Class Homophily Variance Learning","authors":"Mingjian Guang;Rui Zhang;Dawei Cheng;Xiaoyang Wang;Xin Liu;Jie Yang;Yi Ouyang;Xian Wu;Yefeng Zheng","doi":"10.1109/TPAMI.2025.3640635","DOIUrl":"10.1109/TPAMI.2025.3640635","url":null,"abstract":"Graph Neural Networks (GNNs) have achieved remarkable success in machine learning tasks by learning the features of graph data. However, experiments show that vanilla GNNs fail to achieve good classification performance in the field of graph anomaly detection. To address this issue, we propose and theoretically prove that the high-Class Homophily Variance (CHV) characteristic is the reason behind the suboptimal performance of GNN models in anomaly detection tasks. Statistical analysis shows that in most standard node classification datasets, homophily levels are similar across all classes, so CHV is low. In contrast, graph anomaly detection datasets have high CHV, as benign nodes are highly homophilic while anomalies are not, leading to a clear separation. To mitigate its impact, we propose a novel GNN model named Homophily Edge Augment Graph Neural Network (<monospace>HEAug</monospace>). Different from previous work, our method emphasizes generating new edges with low CHV value, using the original edges as an auxiliary. <monospace>HEAug</monospace> samples homophily adjacency matrices from scratch using a self-attention mechanism, and leverages nodes that are relevant in the feature space but not directly connected in the original graph. Additionally, we modify the loss function to punish the generation of unnecessary heterophilic edges by the model. Extensive comparison experiments demonstrate that <monospace>HEAug</monospace> achieved the best performance across eight benchmark datasets, including anomaly detection, edgeless node classification and adversarial attack. We also defined a heterophily attack to increase the CHV value in other graphs, demonstrating the effectiveness of our theory and model in various scenarios.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3835-3851"},"PeriodicalIF":18.6,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}