Pub Date : 2025-10-14DOI: 10.1109/TPAMI.2025.3621250
Jie Wen;Yicheng Liu;Chao Huang;Chengliang Liu;Yong Xu;Xiaochun Cao
Fine-tuning pre-trained vision-language models (VLMs) has shown substantial benefits in a wide range of downstream tasks, often achieving impressive performance with minimal labeled data. Parameter-efficient fine-tuning techniques, in particular, have demonstrated their effectiveness in enhancing downstream task performance. However, these methods frequently struggle to generalize to out-of-distribution (OOD) data due to their reliance on non-causal representations, which can introduce biases and spurious correlations that negatively impact decision-making. Such spurious factors hinder the model’s generalization ability beyond the training distribution. To address these challenges, in this paper, we propose a novel causal intervention-based prompt tuning method to adapt VLMs to few-shot OOD generalization. Specifically, we leverage the front-door adjustment technique from causal inference to mitigate the effects of spurious correlations and enhance the model’s focus on causal relationships. Built upon VLMs, our approach begins by decoupling causal and non-causal representations in the vision-language alignment process. The causal representation that captures only essential semantically relevant information can serve as a mediator variable between the input image and output label, mitigating the biases from the latent confounder. To further enrich this causal representation, we propose a novel text-based diversity augmentation technique that uses textual features to provide additional semantic context. This augmentation technique can enhance the diversity of the causal representation, making it more robust and generalizable to various OOD scenarios. Experimental results across multiple OOD datasets demonstrate that our method significantly outperforms existing approaches, achieving state-of-the-art generalization performance.
{"title":"Causal Interventional Prompt Tuning for Few-Shot Out-of-Distribution Generalization","authors":"Jie Wen;Yicheng Liu;Chao Huang;Chengliang Liu;Yong Xu;Xiaochun Cao","doi":"10.1109/TPAMI.2025.3621250","DOIUrl":"10.1109/TPAMI.2025.3621250","url":null,"abstract":"Fine-tuning pre-trained vision-language models (VLMs) has shown substantial benefits in a wide range of downstream tasks, often achieving impressive performance with minimal labeled data. Parameter-efficient fine-tuning techniques, in particular, have demonstrated their effectiveness in enhancing downstream task performance. However, these methods frequently struggle to generalize to out-of-distribution (OOD) data due to their reliance on non-causal representations, which can introduce biases and spurious correlations that negatively impact decision-making. Such spurious factors hinder the model’s generalization ability beyond the training distribution. To address these challenges, in this paper, we propose a novel causal intervention-based prompt tuning method to adapt VLMs to few-shot OOD generalization. Specifically, we leverage the front-door adjustment technique from causal inference to mitigate the effects of spurious correlations and enhance the model’s focus on causal relationships. Built upon VLMs, our approach begins by decoupling causal and non-causal representations in the vision-language alignment process. The causal representation that captures only essential semantically relevant information can serve as a mediator variable between the input image and output label, mitigating the biases from the latent confounder. To further enrich this causal representation, we propose a novel text-based diversity augmentation technique that uses textual features to provide additional semantic context. This augmentation technique can enhance the diversity of the causal representation, making it more robust and generalizable to various OOD scenarios. Experimental results across multiple OOD datasets demonstrate that our method significantly outperforms existing approaches, achieving state-of-the-art generalization performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1978-1991"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145288451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1109/TPAMI.2025.3621650
Ngoc-Quan Ha-Phan;Myungsik Yoo
LiDAR perception for autonomous driving applications offers highly accurate scene depiction in three-dimensional (3D) spaces, whose most representative task is LiDAR panoptic segmentation (LPS), as it offers exhibition of both instance- and semantic-level segmentation in a holistic manner. Although previous approaches have achieved mature performance, no research has explored temporal information for enhancing LPS performance. As multi-frame processing can assist in better predictions in terms of feature representation and recursive forecasting, which has been proven in other LiDAR perception challenges, this study proposes an effective and temporal-aware panoptic segmentation method for LiDAR point clouds. Specifically, we introduce two modules: convolution-based cross-frame fusion attention (CFFA) and adjacent shifted feature encoder (ASFE) modules. The CFFA module can fuse multi-frame features on the basis of the idea of convolution-based attention, whereas the ASFE module leverages adjacent model outputs and serves as an intermediate guide for final segmentation predictions. Consequent to our extensive experiments, the two modules have been reaffirmed in terms of their productivity in the realm of the LPS. The proposed LPS model achieves impressive panoptic-quality metric scores that are evaluated on different popular benchmarks (63.36% under SemanticKITTI and 78.54% under Panoptic nuScenes), outperforming previous state-of-the-art methods by a significant margin. Further quantitative and qualitative analyses provide evidence of the advantages of multi-frame processing for the LPS together with demonstrations of its particular behavior under different settings.
{"title":"Exploiting the Benefits of Temporal Information in the Realm of LiDAR Panoptic Segmentation","authors":"Ngoc-Quan Ha-Phan;Myungsik Yoo","doi":"10.1109/TPAMI.2025.3621650","DOIUrl":"10.1109/TPAMI.2025.3621650","url":null,"abstract":"LiDAR perception for autonomous driving applications offers highly accurate scene depiction in three-dimensional (3D) spaces, whose most representative task is LiDAR panoptic segmentation (LPS), as it offers exhibition of both instance- and semantic-level segmentation in a holistic manner. Although previous approaches have achieved mature performance, no research has explored temporal information for enhancing LPS performance. As multi-frame processing can assist in better predictions in terms of feature representation and recursive forecasting, which has been proven in other LiDAR perception challenges, this study proposes an effective and temporal-aware panoptic segmentation method for LiDAR point clouds. Specifically, we introduce two modules: convolution-based cross-frame fusion attention (CFFA) and adjacent shifted feature encoder (ASFE) modules. The CFFA module can fuse multi-frame features on the basis of the idea of convolution-based attention, whereas the ASFE module leverages adjacent model outputs and serves as an intermediate guide for final segmentation predictions. Consequent to our extensive experiments, the two modules have been reaffirmed in terms of their productivity in the realm of the LPS. The proposed LPS model achieves impressive panoptic-quality metric scores that are evaluated on different popular benchmarks (63.36% under SemanticKITTI and 78.54% under Panoptic nuScenes), outperforming previous state-of-the-art methods by a significant margin. Further quantitative and qualitative analyses provide evidence of the advantages of multi-frame processing for the LPS together with demonstrations of its particular behavior under different settings.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2048-2065"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145288448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1109/TPAMI.2025.3621326
Simone Alberto Peirone;Francesca Pistilli;Antonio Alliegro;Tatiana Tommasi;Giuseppe Averta
Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4D benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.
{"title":"Hier-EgoPack: Hierarchical Egocentric Video Understanding With Diverse Task Perspectives","authors":"Simone Alberto Peirone;Francesca Pistilli;Antonio Alliegro;Tatiana Tommasi;Giuseppe Averta","doi":"10.1109/TPAMI.2025.3621326","DOIUrl":"10.1109/TPAMI.2025.3621326","url":null,"abstract":"Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4D benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1917-1931"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11202655","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1109/TPAMI.2025.3621246
Weixin Ye;Wei Wang;Yahui Liu;Yue Song;Bin Ren;Wei Bi;Rita Cucchiara;Nicu Sebe
In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable unknown (unk) position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models’ robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (e.g., ImageNet-1 K) and sentiment analysis for text (e.g., Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks.
{"title":"A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models","authors":"Weixin Ye;Wei Wang;Yahui Liu;Yue Song;Bin Ren;Wei Bi;Rita Cucchiara;Nicu Sebe","doi":"10.1109/TPAMI.2025.3621246","DOIUrl":"10.1109/TPAMI.2025.3621246","url":null,"abstract":"In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable <italic>unknown (unk)</i> position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models’ robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (e.g., ImageNet-1 K) and sentiment analysis for text (e.g., Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1873-1887"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145288485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1109/TPAMI.2025.3621229
Zhuosheng Zhang;Siru Ouyang;Hai Zhao
Recent Transformer-based language representation techniques have commonly adopted a straightforward approach to modeling textual context as a linear sequence of successive tokens. However, this sequential modeling strategy falls short in actively exploring intermediate structures present in natural languages and does not account for the rich interactive relationships between sentences. To overcome these limitations, we propose a discourse-aware framework that bridges the gap between sequential contextualization and the interactive nature of conversational reading comprehension. Concretely, we first divide the context into elementary discourse units (EDUs), ensuring that each unit contains precisely one condition. Then, we systematically explore three instantiations for modeling discourse features: sequential EDU encoding, discourse-aware masking, and discourse graph network. These techniques allow us to capture the nuanced interactions within the discourse. To assess the efficacy of our methodologies, we perform experiments on three conversational reading comprehension tasks: multi-turn response selection, conversational question answering, and conversational machine reading. Experimental results demonstrate the superiority of our proposed approach. Moreover, analysis reveals that the discourse-aware approach enables the model to effectively capture intricate relationships within the context and fosters reasoning interpretability. Additionally, our method exhibits efficacy across various backbone PLMs and diverse domains.
{"title":"Discourse-Aware Language Representation","authors":"Zhuosheng Zhang;Siru Ouyang;Hai Zhao","doi":"10.1109/TPAMI.2025.3621229","DOIUrl":"10.1109/TPAMI.2025.3621229","url":null,"abstract":"Recent Transformer-based language representation techniques have commonly adopted a straightforward approach to modeling textual context as a linear sequence of successive tokens. However, this sequential modeling strategy falls short in actively exploring intermediate structures present in natural languages and does not account for the rich interactive relationships between sentences. To overcome these limitations, we propose a discourse-aware framework that bridges the gap between sequential contextualization and the interactive nature of conversational reading comprehension. Concretely, we first divide the context into elementary discourse units (EDUs), ensuring that each unit contains precisely one condition. Then, we systematically explore three instantiations for modeling discourse features: sequential EDU encoding, discourse-aware masking, and discourse graph network. These techniques allow us to capture the nuanced interactions within the discourse. To assess the efficacy of our methodologies, we perform experiments on three conversational reading comprehension tasks: multi-turn response selection, conversational question answering, and conversational machine reading. Experimental results demonstrate the superiority of our proposed approach. Moreover, analysis reveals that the discourse-aware approach enables the model to effectively capture intricate relationships within the context and fosters reasoning interpretability. Additionally, our method exhibits efficacy across various backbone PLMs and diverse domains.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1888-1903"},"PeriodicalIF":18.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-13DOI: 10.1109/TPAMI.2025.3620388
Zheng Lian;Mingyu Xu;Lan Chen;Licai Sun;Bin Liu;Lei Feng;Jianhua Tao
Partial label learning (PLL) is a typical weakly supervised learning, where each sample is associated with a set of candidate labels. Its basic assumption is that the ground-truth label must be in the candidate set, but this assumption may not be satisfied due to the unprofessional judgment of annotators. Therefore, we relax this assumption and focus on a more general task, noisy PLL, where the ground-truth label may not exist in the candidate set. To address this challenging task, we propose a novel framework called “Iterative Refinement Network (IRNet)”, aiming to purify noisy samples through two key modules (i.e., noisy sample detection and label correction). To achieve better performance, we exploit smoothness constraints to reduce prediction errors in these modules. Through theoretical analysis, we prove that IRNet is able to reduce the noise level of the dataset and eventually approximate the Bayes optimal classifier. Meanwhile, IRNet is a plug-in strategy that can be integrated with existing PLL approaches. Experimental results on multiple benchmark datasets show that IRNet outperforms state-of-the-art approaches on noisy PLL.
{"title":"IRNet: Iterative Refinement Network for Noisy Partial Label Learning","authors":"Zheng Lian;Mingyu Xu;Lan Chen;Licai Sun;Bin Liu;Lei Feng;Jianhua Tao","doi":"10.1109/TPAMI.2025.3620388","DOIUrl":"10.1109/TPAMI.2025.3620388","url":null,"abstract":"Partial label learning (PLL) is a typical weakly supervised learning, where each sample is associated with a set of candidate labels. Its basic assumption is that the ground-truth label must be in the candidate set, but this assumption may not be satisfied due to the unprofessional judgment of annotators. Therefore, we relax this assumption and focus on a more general task, noisy PLL, where the ground-truth label may not exist in the candidate set. To address this challenging task, we propose a novel framework called “Iterative Refinement Network (IRNet)”, aiming to purify noisy samples through two key modules (i.e., noisy sample detection and label correction). To achieve better performance, we exploit smoothness constraints to reduce prediction errors in these modules. Through theoretical analysis, we prove that IRNet is able to reduce the noise level of the dataset and eventually approximate the Bayes optimal classifier. Meanwhile, IRNet is a plug-in strategy that can be integrated with existing PLL approaches. Experimental results on multiple benchmark datasets show that IRNet outperforms state-of-the-art approaches on noisy PLL.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1932-1948"},"PeriodicalIF":18.6,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145282990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1109/TPAMI.2025.3620139
Min Cao;Xinyu Zhou;Ding Jiang;Bo Du;Mang Ye;Min Zhang
Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets.
{"title":"Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning","authors":"Min Cao;Xinyu Zhou;Ding Jiang;Bo Du;Mang Ye;Min Zhang","doi":"10.1109/TPAMI.2025.3620139","DOIUrl":"10.1109/TPAMI.2025.3620139","url":null,"abstract":"Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1961-1977"},"PeriodicalIF":18.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145260880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In most existing multi-view modeling scenarios, cross-view correspondence (CVC) between instances of the same target from different views, like paired image-text data, is a crucial prerequisite for effortlessly deriving a consistent representation. Nevertheless, this premise is frequently compromised in certain applications, where each view is organized and transmitted independently, resulting in the view-unaligned problem (VuP). Restoring CVC of unaligned multi-view data is a challenging and highly demanding task that has received limited attention from the research community. To tackle this practical challenge, we propose to integrate the permutation derivation procedure into the bipartite graph paradigm for view-unaligned clustering, termed Probabilistically Aligned View-unaligned Clustering with Adaptive Template Selection (PAVuC-ATS). Specifically, we learn consistent anchors and view-specific graphs by the bipartite graph, and derive permutations applied to the unaligned graphs by reformulating the alignment between two latent representations as a 2-step transition of a Markov chain with adaptive template selection, thereby achieving the probabilistic alignment. The convergence of the resultant optimization problem is validated both experimentally and theoretically. Extensive experiments on six benchmark datasets demonstrate the superiority of the proposed PAVuC-ATS over the baseline methods.
{"title":"Probabilistically Aligned View-Unaligned Clustering With Adaptive Template Selection","authors":"Wenhua Dong;Xiao-Jun Wu;Zhenhua Feng;Sara Atito;Muhammad Awais;Josef Kittler","doi":"10.1109/TPAMI.2025.3618984","DOIUrl":"10.1109/TPAMI.2025.3618984","url":null,"abstract":"In most existing multi-view modeling scenarios, cross-view correspondence (CVC) between instances of the same target from different views, like paired image-text data, is a crucial prerequisite for effortlessly deriving a consistent representation. Nevertheless, this premise is frequently compromised in certain applications, where each view is organized and transmitted independently, resulting in the view-unaligned problem (VuP). Restoring CVC of unaligned multi-view data is a challenging and highly demanding task that has received limited attention from the research community. To tackle this practical challenge, we propose to integrate the permutation derivation procedure into the bipartite graph paradigm for view-unaligned clustering, termed Probabilistically Aligned View-unaligned Clustering with Adaptive Template Selection (PAVuC-ATS). Specifically, we learn consistent anchors and view-specific graphs by the bipartite graph, and derive permutations applied to the unaligned graphs by reformulating the alignment between two latent representations as a 2-step transition of a Markov chain with adaptive template selection, thereby achieving the probabilistic alignment. The convergence of the resultant optimization problem is validated both experimentally and theoretically. Extensive experiments on six benchmark datasets demonstrate the superiority of the proposed PAVuC-ATS over the baseline methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1904-1916"},"PeriodicalIF":18.6,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145241237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-07DOI: 10.1109/TPAMI.2025.3618991
Liping Deng;Maziar Raissi;MingQing Xiao
Optimizing the performance of deep neural networks (DNNs) remains a significant challenge due to the sensitivity of models to both hyperparameter selection and weight initialization. Existing approaches typically address these two factors independently, which often leads to limiting adaptability and overall effectiveness. In this paper, we present a novel meta-learning framework that jointly recommends hyperparameters and initial weights by leveraging dataset similarity. Our method begins by extracting meta-features from a collection of historical datasets. For a given query dataset, similarity is computed based on distances in the meta-feature space, and the most similar historical datasets are used to recommend the underlying parameter configurations. To capture the diverse characteristics of image datasets, we introduce two complementary types of meta-features. The first, referred to as shallow or visible meta-features, comprises five groups of statistical measures that summarize color and texture information. The second, termed deep or invisible meta-features, consists of 512 descriptors extracted from a convolutional neural network pre-trained on ImageNet. We evaluated our framework in 105 real-world image classification tasks, using 75 datasets for historical modeling and 30 for querying. Experimental results with both vision transformers and convolutional neural networks demonstrate that our approach consistently outperforms state-of-the-art baselines, underscoring the effectiveness of dataset-driven parameter recommendation in deep learning.
{"title":"Deep Neural Network Parameter Selection via Dataset Similarity Under Meta-Learning Framework","authors":"Liping Deng;Maziar Raissi;MingQing Xiao","doi":"10.1109/TPAMI.2025.3618991","DOIUrl":"10.1109/TPAMI.2025.3618991","url":null,"abstract":"Optimizing the performance of deep neural networks (DNNs) remains a significant challenge due to the sensitivity of models to both hyperparameter selection and weight initialization. Existing approaches typically address these two factors independently, which often leads to limiting adaptability and overall effectiveness. In this paper, we present a novel meta-learning framework that jointly recommends hyperparameters and initial weights by leveraging dataset similarity. Our method begins by extracting meta-features from a collection of historical datasets. For a given query dataset, similarity is computed based on distances in the meta-feature space, and the most similar historical datasets are used to recommend the underlying parameter configurations. To capture the diverse characteristics of image datasets, we introduce two complementary types of meta-features. The first, referred to as shallow or visible meta-features, comprises five groups of statistical measures that summarize color and texture information. The second, termed deep or invisible meta-features, consists of 512 descriptors extracted from a convolutional neural network pre-trained on ImageNet. We evaluated our framework in 105 real-world image classification tasks, using 75 datasets for historical modeling and 30 for querying. Experimental results with both vision transformers and convolutional neural networks demonstrate that our approach consistently outperforms state-of-the-art baselines, underscoring the effectiveness of dataset-driven parameter recommendation in deep learning.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"1860-1872"},"PeriodicalIF":18.6,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145240879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-03DOI: 10.1109/TPAMI.2025.3617660
Tiexin Qin;Benjamin Walker;Terry Lyons;Hong Yan;Haoliang Li
This paper focuses on representation learning for dynamic graphs with temporal interactions. A fundamental issue is that both the graph structure and the nodes own their own dynamics, and their blending induces intractable complexity in the temporal evolution over graphs. Drawing inspiration from the recent progress of physical dynamic models in deep neural networks, we propose Graph Neural Controlled Differential Equations (GN-CDEs), a continuous-time framework that jointly models node embeddings and structural dynamics by incorporating a graph enhanced neural network vector field with a time-varying graph path as the control signal. Our framework exhibits several desirable characteristics, including the ability to express dynamics on evolving graphs without piecewise integration, the capability to calibrate trajectories with subsequent data, and robustness to missing observations. Empirical evaluation on a range of dynamic graph representation learning tasks demonstrates the effectiveness of our proposed approach in capturing the complex dynamics of dynamic graphs.
{"title":"Learning Dynamic Graph Embeddings With Neural Controlled Differential Equations","authors":"Tiexin Qin;Benjamin Walker;Terry Lyons;Hong Yan;Haoliang Li","doi":"10.1109/TPAMI.2025.3617660","DOIUrl":"10.1109/TPAMI.2025.3617660","url":null,"abstract":"This paper focuses on representation learning for dynamic graphs with temporal interactions. A fundamental issue is that both the graph structure and the nodes own their own dynamics, and their blending induces intractable complexity in the temporal evolution over graphs. Drawing inspiration from the recent progress of physical dynamic models in deep neural networks, we propose <italic>Graph Neural Controlled Differential Equations</i> (GN-CDEs), a continuous-time framework that jointly models node embeddings and structural dynamics by incorporating a graph enhanced neural network vector field with a time-varying graph path as the control signal. Our framework exhibits several desirable characteristics, including the ability to express dynamics on evolving graphs without piecewise integration, the capability to calibrate trajectories with subsequent data, and robustness to missing observations. Empirical evaluation on a range of dynamic graph representation learning tasks demonstrates the effectiveness of our proposed approach in capturing the complex dynamics of dynamic graphs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 2","pages":"2096-2103"},"PeriodicalIF":18.6,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145215644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}