Pub Date : 2025-12-05DOI: 10.1109/TPAMI.2025.3640697
Mengyuan Liu;Jinfu Liu;Yongkang Jiang;Bin He
Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and uncrewed aerial vehicles (UAV)-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods.
{"title":"Heatmap Pooling Network for Action Recognition From RGB Videos","authors":"Mengyuan Liu;Jinfu Liu;Yongkang Jiang;Bin He","doi":"10.1109/TPAMI.2025.3640697","DOIUrl":"10.1109/TPAMI.2025.3640697","url":null,"abstract":"Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and uncrewed aerial vehicles (UAV)-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3726-3743"},"PeriodicalIF":18.6,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1109/TPAMI.2025.3640589
Xiang Xu;Lingdong Kong;Hui Shuai;Wenwei Zhang;Liang Pan;Kai Chen;Ziwei Liu;Qingshan Liu
LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
{"title":"Enhanced Spatiotemporal Consistency for Image-to-LiDAR Data Pretraining","authors":"Xiang Xu;Lingdong Kong;Hui Shuai;Wenwei Zhang;Liang Pan;Kai Chen;Ziwei Liu;Qingshan Liu","doi":"10.1109/TPAMI.2025.3640589","DOIUrl":"10.1109/TPAMI.2025.3640589","url":null,"abstract":"LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose <bold>SuperFlow++</b>, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: <bold>(1)</b> a view consistency alignment module to unify semantic information across camera views, <bold>(2)</b> a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, <bold>(3)</b> a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and <bold>(4)</b> a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3819-3834"},"PeriodicalIF":18.6,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Neural Networks (GNNs) have achieved remarkable success in machine learning tasks by learning the features of graph data. However, experiments show that vanilla GNNs fail to achieve good classification performance in the field of graph anomaly detection. To address this issue, we propose and theoretically prove that the high-Class Homophily Variance (CHV) characteristic is the reason behind the suboptimal performance of GNN models in anomaly detection tasks. Statistical analysis shows that in most standard node classification datasets, homophily levels are similar across all classes, so CHV is low. In contrast, graph anomaly detection datasets have high CHV, as benign nodes are highly homophilic while anomalies are not, leading to a clear separation. To mitigate its impact, we propose a novel GNN model named Homophily Edge Augment Graph Neural Network (HEAug). Different from previous work, our method emphasizes generating new edges with low CHV value, using the original edges as an auxiliary. HEAug samples homophily adjacency matrices from scratch using a self-attention mechanism, and leverages nodes that are relevant in the feature space but not directly connected in the original graph. Additionally, we modify the loss function to punish the generation of unnecessary heterophilic edges by the model. Extensive comparison experiments demonstrate that HEAug achieved the best performance across eight benchmark datasets, including anomaly detection, edgeless node classification and adversarial attack. We also defined a heterophily attack to increase the CHV value in other graphs, demonstrating the effectiveness of our theory and model in various scenarios.
{"title":"Homophily Edge Augment Graph Neural Network for High-Class Homophily Variance Learning","authors":"Mingjian Guang;Rui Zhang;Dawei Cheng;Xiaoyang Wang;Xin Liu;Jie Yang;Yi Ouyang;Xian Wu;Yefeng Zheng","doi":"10.1109/TPAMI.2025.3640635","DOIUrl":"10.1109/TPAMI.2025.3640635","url":null,"abstract":"Graph Neural Networks (GNNs) have achieved remarkable success in machine learning tasks by learning the features of graph data. However, experiments show that vanilla GNNs fail to achieve good classification performance in the field of graph anomaly detection. To address this issue, we propose and theoretically prove that the high-Class Homophily Variance (CHV) characteristic is the reason behind the suboptimal performance of GNN models in anomaly detection tasks. Statistical analysis shows that in most standard node classification datasets, homophily levels are similar across all classes, so CHV is low. In contrast, graph anomaly detection datasets have high CHV, as benign nodes are highly homophilic while anomalies are not, leading to a clear separation. To mitigate its impact, we propose a novel GNN model named Homophily Edge Augment Graph Neural Network (<monospace>HEAug</monospace>). Different from previous work, our method emphasizes generating new edges with low CHV value, using the original edges as an auxiliary. <monospace>HEAug</monospace> samples homophily adjacency matrices from scratch using a self-attention mechanism, and leverages nodes that are relevant in the feature space but not directly connected in the original graph. Additionally, we modify the loss function to punish the generation of unnecessary heterophilic edges by the model. Extensive comparison experiments demonstrate that <monospace>HEAug</monospace> achieved the best performance across eight benchmark datasets, including anomaly detection, edgeless node classification and adversarial attack. We also defined a heterophily attack to increase the CHV value in other graphs, demonstrating the effectiveness of our theory and model in various scenarios.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3835-3851"},"PeriodicalIF":18.6,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1109/TPAMI.2025.3640109
Xinwang Liu;Ke Liang;Jun Wang;Suyuan Liu;Xiangke Wang;Huaimin Wang
Multi-view clustering (MVC), as an important machine learning task, aims to group data into distinct groups by leveraging complementary and consistent information across multiple views. During the last two decades, it has been widely studied, and many methods have been proposed, which has brought incredible development to this field. However, few works comprehensively summarize existing methods and point out the potential challenges in this field for the next decades. To this end, our survey thoroughly reviews existing MVC methods according to three taxonomies, i.e., techniques, fusion strategies, and scenarios. Specifically, seven typical techniques, four fusion strategies, and five typical scenarios are included. Besides, we also collect the commonly used datasets and analyze the performance of typical MVC methods. Moreover, we summarize six application scenarios of existing MVC methods ranging from computer vision, and information retrieval tasks to medical diagnosis and bio-informatics. In particular, we point out seven interesting future directions in this field, which will definitely enlighten the readers.
{"title":"Two Decades of Multi-View Clustering: Taxonomy, Application, and Challenge","authors":"Xinwang Liu;Ke Liang;Jun Wang;Suyuan Liu;Xiangke Wang;Huaimin Wang","doi":"10.1109/TPAMI.2025.3640109","DOIUrl":"10.1109/TPAMI.2025.3640109","url":null,"abstract":"Multi-view clustering (MVC), as an important machine learning task, aims to group data into distinct groups by leveraging complementary and consistent information across multiple views. During the last two decades, it has been widely studied, and many methods have been proposed, which has brought incredible development to this field. However, few works comprehensively summarize existing methods and point out the potential challenges in this field for the next decades. To this end, our survey thoroughly reviews existing MVC methods according to three taxonomies, i.e., techniques, fusion strategies, and scenarios. Specifically, seven typical techniques, four fusion strategies, and five typical scenarios are included. Besides, we also collect the commonly used datasets and analyze the performance of typical MVC methods. Moreover, we summarize six application scenarios of existing MVC methods ranging from computer vision, and information retrieval tasks to medical diagnosis and bio-informatics. In particular, we point out seven interesting future directions in this field, which will definitely enlighten the readers.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3744-3764"},"PeriodicalIF":18.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1109/TPAMI.2025.3640429
Kaiyan Zhang;Xinghui Li;Jingyi Lu;Kai Han
Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in the literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding of existing methods for semantic matching, we thoroughly conduct controlled experiments to analyze the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development.
{"title":"Semantic Correspondence: Unified Benchmarking and a Strong Baseline","authors":"Kaiyan Zhang;Xinghui Li;Jingyi Lu;Kai Han","doi":"10.1109/TPAMI.2025.3640429","DOIUrl":"10.1109/TPAMI.2025.3640429","url":null,"abstract":"Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in the literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding of existing methods for semantic matching, we thoroughly conduct controlled experiments to analyze the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3911-3930"},"PeriodicalIF":18.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1109/TPAMI.2025.3640172
Viet Quan Le;Viet Cuong Ta
Learning over dynamic graphs poses major challenges, including capturing the evolving relationship in the graphs. Inspired by the advantages of hyperbolic embedding in static graphs, the hyperbolic space is expected to capture complex interactions in dynamic graphs. However, due to the distortion errors in the standard tangent space mappings, hyperbolic methods become more sensitive to noise and reduce the learning capacity. To address the distortion in tangent space, we proposed HMPTGN, a temporal graph network that operates directly on the hyperbolic manifold. In this journal paper, we introduce the HMPTGN+ architecture, an extension of the original HMPTGN with major updates to learn better representations of dynamic graphs based on the hyperbolic embedding. Our framework incorporates a high-order graph neural network for extracting spatial dependencies, a dilated causal attention mechanism for modeling temporal patterns while preserving causality, and a curvature-awareness mechanism to capture dynamic structures. Extensive experiments demonstrate the effectiveness of our proposed HMPTGN+ framework over state-of-the-art baselines in both temporal link prediction and temporal new link prediction tasks.
{"title":"Toward an Advanced Temporal Graph Network in Hyperbolic Space","authors":"Viet Quan Le;Viet Cuong Ta","doi":"10.1109/TPAMI.2025.3640172","DOIUrl":"10.1109/TPAMI.2025.3640172","url":null,"abstract":"Learning over dynamic graphs poses major challenges, including capturing the evolving relationship in the graphs. Inspired by the advantages of hyperbolic embedding in static graphs, the hyperbolic space is expected to capture complex interactions in dynamic graphs. However, due to the distortion errors in the standard tangent space mappings, hyperbolic methods become more sensitive to noise and reduce the learning capacity. To address the distortion in tangent space, we proposed HMPTGN, a temporal graph network that operates directly on the hyperbolic manifold. In this journal paper, we introduce the HMPTGN+ architecture, an extension of the original HMPTGN with major updates to learn better representations of dynamic graphs based on the hyperbolic embedding. Our framework incorporates a high-order graph neural network for extracting spatial dependencies, a dilated causal attention mechanism for modeling temporal patterns while preserving causality, and a curvature-awareness mechanism to capture dynamic structures. Extensive experiments demonstrate the effectiveness of our proposed HMPTGN+ framework over state-of-the-art baselines in both <italic>temporal link prediction</i> and <italic>temporal new link prediction</i> tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3868-3884"},"PeriodicalIF":18.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TPAMI.2025.3639647
Chuyu Wang;Huiting Deng;Dong Liu
Electrical Impedance Tomography (EIT) provides a non-invasive, portable imaging modality with significant potential in medical and industrial applications. Despite its advantages, EIT encounters two primary challenges: the ill-posed nature of its inverse problem and the spatially variable, location-dependent sensitivity distribution. Traditional model-based methods mitigate ill-posedness through regularization but overlook sensitivity variability, while supervised deep learning approaches require extensive training data and lack generalization. Recent developments in neural fields have introduced implicit regularization techniques for image reconstruction; however, these methods often overlook the physical principles underlying EIT, thereby limiting their effectiveness. In this study, we propose PhyNC (Physics-driven Neural Compensation), an unsupervised deep learning framework that incorporates the physical principles of EIT. PhyNC addresses both the ill-posed inverse problem and the sensitivity distribution by dynamically allocating neural representational capacity to regions with lower sensitivity, ensuring accurate and balanced conductivity reconstructions. Extensive evaluations on both simulated and experimental data demonstrate that PhyNC outperforms existing methods in terms of detail preservation and artifact resistance, particularly in low-sensitivity regions. Our approach enhances the robustness of EIT reconstructions and provides a flexible framework that can be adapted to other imaging modalities with similar challenges.
{"title":"Physics-Driven Neural Compensation for Electrical Impedance Tomography","authors":"Chuyu Wang;Huiting Deng;Dong Liu","doi":"10.1109/TPAMI.2025.3639647","DOIUrl":"10.1109/TPAMI.2025.3639647","url":null,"abstract":"Electrical Impedance Tomography (EIT) provides a non-invasive, portable imaging modality with significant potential in medical and industrial applications. Despite its advantages, EIT encounters two primary challenges: the ill-posed nature of its inverse problem and the spatially variable, location-dependent sensitivity distribution. Traditional model-based methods mitigate ill-posedness through regularization but overlook sensitivity variability, while supervised deep learning approaches require extensive training data and lack generalization. Recent developments in neural fields have introduced implicit regularization techniques for image reconstruction; however, these methods often overlook the physical principles underlying EIT, thereby limiting their effectiveness. In this study, we propose PhyNC (Physics-driven Neural Compensation), an unsupervised deep learning framework that incorporates the physical principles of EIT. PhyNC addresses both the ill-posed inverse problem and the sensitivity distribution by dynamically allocating neural representational capacity to regions with lower sensitivity, ensuring accurate and balanced conductivity reconstructions. Extensive evaluations on both simulated and experimental data demonstrate that PhyNC outperforms existing methods in terms of detail preservation and artifact resistance, particularly in low-sensitivity regions. Our approach enhances the robustness of EIT reconstructions and provides a flexible framework that can be adapted to other imaging modalities with similar challenges.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3783-3800"},"PeriodicalIF":18.6,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145664583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TPAMI.2025.3639593
Linshan Wu;Jiaxin Zhuang;Hao Chen
The scarcity of annotations poses a significant challenge in medical image analysis, which demands extensive efforts from radiologists, especially for high-dimension 3D medical images. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo implicitly encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. To assess effectiveness, we (1) introduce PreCT-160 K, the largest medical image pre-training dataset to date, which comprises 160 K Computed Tomography (CT) volumes covering diverse anatomic structures; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a comprehensive benchmark encompassing 51 medical tasks, including segmentation, classification, registration, and vision-language. Extensive experiments highlight the superiority of VoCo, showcasing promising transferability to unseen modalities and datasets. VoCo notably enhances performance on datasets with limited labeled cases and significantly expedites fine-tuning convergence.
{"title":"Large-Scale 3D Medical Image Pre-Training With Geometric Context Priors","authors":"Linshan Wu;Jiaxin Zhuang;Hao Chen","doi":"10.1109/TPAMI.2025.3639593","DOIUrl":"10.1109/TPAMI.2025.3639593","url":null,"abstract":"The scarcity of annotations poses a significant challenge in medical image analysis, which demands extensive efforts from radiologists, especially for high-dimension 3D medical images. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective <bold>Vo</b>lume <bold>Co</b>ntrast (<bold>VoCo</b>) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo implicitly encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. To assess effectiveness, we <bold>(1)</b> introduce PreCT-160 K, the largest medical image pre-training dataset to date, which comprises 160 K Computed Tomography (CT) volumes covering diverse anatomic structures; <bold>(2)</b> investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; <bold>(3)</b> build a comprehensive benchmark encompassing 51 medical tasks, including segmentation, classification, registration, and vision-language. Extensive experiments highlight the superiority of VoCo, showcasing promising transferability to unseen modalities and datasets. VoCo notably enhances performance on datasets with limited labeled cases and significantly expedites fine-tuning convergence.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3801-3818"},"PeriodicalIF":18.6,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145664266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TPAMI.2025.3639522
Mark Lindsey;Francis Kubala;Richard M. Stern
Online Active Learning (OAL) is a powerful tool for classifying evolving data streams using limited annotations from a human operator who is a domain expert. The objective of the OAL learning paradigm is to minimize jointly the classification error rate and the annotation cost across the data stream by posing periodic Active Learning (AL) queries. In this paper, this objective is extended to include identification of classifier errors by the expert during the typical workflow. To this end, Corrective Feedback (CF) is introduced as a second channel of interaction between the expert and the learning algorithm, complementary to the AL channel, that allows the algorithm to obtain additional training labels without disrupting the expert’s workflow. Online Active Learning with Corrective Feedback (OAL-CF) is formally defined as a paradigm, and its efficacy is proven through experimental application to two binary classification tasks, Spoken Language Verification and Voice-Type Discrimination. Finally, the effects of adding CF to the OAL paradigm are analyzed in terms of classification performance, annotation cost, trends over time, and class balance of the collected training data. Overall, the addition of CF results in a 53% relative reduction in cost compared to OAL without CF.
在线主动学习(Online Active Learning, OAL)是一种强大的工具,它利用领域专家操作员提供的有限注释对不断变化的数据流进行分类。主动学习(AL)学习范式的目标是通过提出周期性的主动学习(AL)查询,使数据流中的分类错误率和标注成本共同最小化。在本文中,这一目标被扩展到包括专家在典型工作流程中识别分类器错误。为此,引入纠正反馈(CF)作为专家和学习算法之间的第二个交互通道,补充人工智能通道,使算法能够在不中断专家工作流程的情况下获得额外的训练标签。在线主动学习与纠正反馈(Online Active Learning with Corrective Feedback,简称al - cf)被正式定义为一种范式,并通过实验应用于口语验证和语音类型识别两个二元分类任务,证明了其有效性。最后,从分类性能、注释成本、随时间变化的趋势和收集的训练数据的类平衡等方面分析了将CF添加到OAL范式的影响。总的来说,与不添加CF的OAL相比,添加CF的成本相对降低了53%。
{"title":"The Value of Corrective Feedback in the Online Active Learning Paradigm","authors":"Mark Lindsey;Francis Kubala;Richard M. Stern","doi":"10.1109/TPAMI.2025.3639522","DOIUrl":"10.1109/TPAMI.2025.3639522","url":null,"abstract":"Online Active Learning (OAL) is a powerful tool for classifying evolving data streams using limited annotations from a human operator who is a domain expert. The objective of the OAL learning paradigm is to minimize jointly the classification error rate and the annotation cost across the data stream by posing periodic Active Learning (AL) queries. In this paper, this objective is extended to include identification of classifier errors by the expert during the typical workflow. To this end, Corrective Feedback (CF) is introduced as a second channel of interaction between the expert and the learning algorithm, complementary to the AL channel, that allows the algorithm to obtain additional training labels without disrupting the expert’s workflow. Online Active Learning with Corrective Feedback (OAL-CF) is formally defined as a paradigm, and its efficacy is proven through experimental application to two binary classification tasks, Spoken Language Verification and Voice-Type Discrimination. Finally, the effects of adding CF to the OAL paradigm are analyzed in terms of classification performance, annotation cost, trends over time, and class balance of the collected training data. Overall, the addition of CF results in a 53% relative reduction in cost compared to OAL without CF.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3885-3898"},"PeriodicalIF":18.6,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11274545","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145663912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TPAMI.2025.3639635
Xingcai Zhou;Guang Yang;Haotian Zheng;Linglong Kong;Jinde Cao
We study distributed principal component analysis (PCA) for large-scale federated data when the sample size $n$ and dimension $d$ are both ultra-large. This type of data is currently very common, but faces numerous challenges in PCA learning, such as communication overhead and computational complexity. We develop a new algorithm ${mathsf {FedFask}}$ (Fast Sketching for Federated learning) with lower communication cost $O(dr)$ and lower computational complexity $O(d(np/m+p^{2}+r^{2}))$, where $m$ is the number of workers, $r$ is the rank of matrix, $p$ is the dimension of sketched column space, and $rleq pll d$. In ${mathsf {FedFask}}$, we adopt and develop technologies such as fast sketching, alignments with orthogonal Procrustes Fixing, and matrix Stiefel manifold via Kolmogorov-Nagumo-type average. Thus, ${mathsf {FedFask}}$ has a higher accuracy, lower stochastic variation, and best representation of multiple randomly projected eigenspaces, and avoids the orthogonal ambiguity of eigenspaces. We show that ${mathsf {FedFask}}$ achieves the same rate of learning $Oleft(frac{kappa _{r}r}{lambda _{r}}sqrt{frac{r^{*}}{n}}right)$ as the centralized PCA uses all data, and tolerates more workers to parallel acceleration computation. We conduct extensive experiments to demonstrate the effectiveness of ${mathsf {FedFask}}$.
{"title":"FedFask: Fast Sketching Distributed PCA for Large-Scale Federated Data","authors":"Xingcai Zhou;Guang Yang;Haotian Zheng;Linglong Kong;Jinde Cao","doi":"10.1109/TPAMI.2025.3639635","DOIUrl":"10.1109/TPAMI.2025.3639635","url":null,"abstract":"We study distributed principal component analysis (PCA) for large-scale federated data when the sample size <inline-formula><tex-math>$n$</tex-math></inline-formula> and dimension <inline-formula><tex-math>$d$</tex-math></inline-formula> are both ultra-large. This type of data is currently very common, but faces numerous challenges in PCA learning, such as communication overhead and computational complexity. We develop a new algorithm <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula> (<b>Fa</b>st <b>Sk</b>etching for <b>Fed</b>erated learning) with lower communication cost <inline-formula><tex-math>$O(dr)$</tex-math></inline-formula> and lower computational complexity <inline-formula><tex-math>$O(d(np/m+p^{2}+r^{2}))$</tex-math></inline-formula>, where <inline-formula><tex-math>$m$</tex-math></inline-formula> is the number of workers, <inline-formula><tex-math>$r$</tex-math></inline-formula> is the rank of matrix, <inline-formula><tex-math>$p$</tex-math></inline-formula> is the dimension of sketched column space, and <inline-formula><tex-math>$rleq pll d$</tex-math></inline-formula>. In <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula>, we adopt and develop technologies such as fast sketching, alignments with orthogonal Procrustes Fixing, and matrix Stiefel manifold via Kolmogorov-Nagumo-type average. Thus, <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula> has a higher accuracy, lower stochastic variation, and best representation of multiple randomly projected eigenspaces, and avoids the orthogonal ambiguity of eigenspaces. We show that <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula> achieves the same rate of learning <inline-formula><tex-math>$Oleft(frac{kappa _{r}r}{lambda _{r}}sqrt{frac{r^{*}}{n}}right)$</tex-math></inline-formula> as the centralized PCA uses all data, and tolerates more workers to parallel acceleration computation. We conduct extensive experiments to demonstrate the effectiveness of <inline-formula><tex-math>${mathsf {FedFask}}$</tex-math></inline-formula>.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 3","pages":"3714-3725"},"PeriodicalIF":18.6,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145664268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}