Pub Date : 2024-08-21DOI: 10.1109/TPAMI.2024.3447085
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi
Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and to accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. More than three thousand pruning papers have been published from 2020 to 2024. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of eight pairs of contrast settings for pruning (e.g., unstructured/structured, one-shot/iterative, data-free/data-driven, initialized/pre-trained weights, etc.) and explore several emerging topics, including pruning for large language models, vision transformers, diffusion models, and large multimodal models, post-training pruning, and different levels of supervision for pruning to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. Finally, we provide some valuable recommendations on selecting pruning methods and prospect several promising research directions for neural network pruning. To facilitate future research on deep neural network pruning, we summarize broad pruning applications (e.g., adversarial robustness, natural language understanding, etc.) and build a curated collection of datasets, networks, and evaluations on different applications. We maintain a repository on https://github.com/hrcheng1066/awesome-pruning that serves as a comprehensive resource for neural network pruning papers and corresponding open-source codes. We will keep updating this repository to include the latest advancements in the field.
{"title":"A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations.","authors":"Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi","doi":"10.1109/TPAMI.2024.3447085","DOIUrl":"10.1109/TPAMI.2024.3447085","url":null,"abstract":"<p><p>Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and to accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. More than three thousand pruning papers have been published from 2020 to 2024. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of eight pairs of contrast settings for pruning (e.g., unstructured/structured, one-shot/iterative, data-free/data-driven, initialized/pre-trained weights, etc.) and explore several emerging topics, including pruning for large language models, vision transformers, diffusion models, and large multimodal models, post-training pruning, and different levels of supervision for pruning to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. Finally, we provide some valuable recommendations on selecting pruning methods and prospect several promising research directions for neural network pruning. To facilitate future research on deep neural network pruning, we summarize broad pruning applications (e.g., adversarial robustness, natural language understanding, etc.) and build a curated collection of datasets, networks, and evaluations on different applications. We maintain a repository on https://github.com/hrcheng1066/awesome-pruning that serves as a comprehensive resource for neural network pruning papers and corresponding open-source codes. We will keep updating this repository to include the latest advancements in the field.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1109/TPAMI.2024.3447287
Feiping Nie, Chaodie Liu, Rong Wang, Xuelong Li
Spectral clustering has been attracting increasing attention due to its well-defined framework and excellent performance. However, most traditional spectral clustering methods consist of two separate steps: 1) Solving a relaxed optimization problem to learn the continuous clustering labels, and 2) Rounding the continuous clustering labels into discrete ones. The clustering results of the relax-and-discretize strategy inevitably result in information loss and unsatisfactory clustering performance. Moreover, the similarity matrix constructed from original data may not be optimal for clustering since data usually have noise and redundancy. To address these problems, we propose a novel and effective algorithm to directly optimize the original spectral clustering model, called Direct Spectral Clustering (DSC). We theoretically prove that the original spectral clustering model can be solved by simultaneously learning a weighted discrete indicator matrix and a structured similarity matrix whose connected components are equal to the number of clusters. Both of them can be used to directly obtain the final clustering results without any post-processing. Further, an effective iterative optimization algorithm is exploited to solve the proposed method. Extensive experiments performed on synthetic and real-world datasets demonstrate the superiority and effectiveness of the proposed method compared to the state-of-the-art algorithms.
{"title":"A Novel and Effective Method to Directly Solve Spectral Clustering.","authors":"Feiping Nie, Chaodie Liu, Rong Wang, Xuelong Li","doi":"10.1109/TPAMI.2024.3447287","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3447287","url":null,"abstract":"<p><p>Spectral clustering has been attracting increasing attention due to its well-defined framework and excellent performance. However, most traditional spectral clustering methods consist of two separate steps: 1) Solving a relaxed optimization problem to learn the continuous clustering labels, and 2) Rounding the continuous clustering labels into discrete ones. The clustering results of the relax-and-discretize strategy inevitably result in information loss and unsatisfactory clustering performance. Moreover, the similarity matrix constructed from original data may not be optimal for clustering since data usually have noise and redundancy. To address these problems, we propose a novel and effective algorithm to directly optimize the original spectral clustering model, called Direct Spectral Clustering (DSC). We theoretically prove that the original spectral clustering model can be solved by simultaneously learning a weighted discrete indicator matrix and a structured similarity matrix whose connected components are equal to the number of clusters. Both of them can be used to directly obtain the final clustering results without any post-processing. Further, an effective iterative optimization algorithm is exploited to solve the proposed method. Extensive experiments performed on synthetic and real-world datasets demonstrate the superiority and effectiveness of the proposed method compared to the state-of-the-art algorithms.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1109/TPAMI.2024.3447008
Tao Xie, Kun Dai, Qihao Sun, Zhiqiang Jiang, Chuqing Cao, Lijun Zhao, Ke Wang, Ruifeng Li
We present CO-Net++, a cohesive framework that optimizes multiple point cloud tasks collectively across heterogeneous dataset domains with a two-stage feature rectification strategy. The core of CO-Net++ lies in optimizing task-shared parameters to capture universal features across various tasks while discerning task-specific parameters tailored to encapsulate the unique characteristics of each task. Specifically, CO-Net++ develops a two-stage feature rectification strategy (TFRS) that distinctly separates the optimization processes for task-shared and task-specific parameters. At the first stage, TFRS configures all parameters in backbone as task-shared, which encourages CO-Net++ to thoroughly assimilate universal attributes pertinent to all tasks. In addition, TFRS introduces a sign-based gradient surgery to facilitate the optimization of task-shared parameters, thus alleviating conflicting gradients induced by various dataset domains. In the second stage, TFRS freezes task-shared parameters and flexibly integrates task-specific parameters into the network for encoding specific characteristics of each dataset domain. CO-Net++ prominently mitigates conflicting optimization caused by parameter entanglement, ensuring the sufficient identification of universal and specific features. Extensive experiments reveal that CO-Net++ realizes exceptional performances on both 3D object detection and 3D semantic segmentation tasks. Moreover, CO-Net++ delivers an impressive incremental learning capability and prevents catastrophic amnesia when generalizing to new point cloud tasks.
{"title":"CO-Net++: A Cohesive Network for Multiple Point Cloud Tasks at Once with Two-Stage Feature Rectification.","authors":"Tao Xie, Kun Dai, Qihao Sun, Zhiqiang Jiang, Chuqing Cao, Lijun Zhao, Ke Wang, Ruifeng Li","doi":"10.1109/TPAMI.2024.3447008","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3447008","url":null,"abstract":"<p><p>We present CO-Net++, a cohesive framework that optimizes multiple point cloud tasks collectively across heterogeneous dataset domains with a two-stage feature rectification strategy. The core of CO-Net++ lies in optimizing task-shared parameters to capture universal features across various tasks while discerning task-specific parameters tailored to encapsulate the unique characteristics of each task. Specifically, CO-Net++ develops a two-stage feature rectification strategy (TFRS) that distinctly separates the optimization processes for task-shared and task-specific parameters. At the first stage, TFRS configures all parameters in backbone as task-shared, which encourages CO-Net++ to thoroughly assimilate universal attributes pertinent to all tasks. In addition, TFRS introduces a sign-based gradient surgery to facilitate the optimization of task-shared parameters, thus alleviating conflicting gradients induced by various dataset domains. In the second stage, TFRS freezes task-shared parameters and flexibly integrates task-specific parameters into the network for encoding specific characteristics of each dataset domain. CO-Net++ prominently mitigates conflicting optimization caused by parameter entanglement, ensuring the sufficient identification of universal and specific features. Extensive experiments reveal that CO-Net++ realizes exceptional performances on both 3D object detection and 3D semantic segmentation tasks. Moreover, CO-Net++ delivers an impressive incremental learning capability and prevents catastrophic amnesia when generalizing to new point cloud tasks.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1109/TPAMI.2024.3445770
Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, Weisi Lin
The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related questionanswering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at https://github.com/Q-Future/Q-Bench.
{"title":"Q-BENCH: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs.","authors":"Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, Weisi Lin","doi":"10.1109/TPAMI.2024.3445770","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3445770","url":null,"abstract":"<p><p>The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related questionanswering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at https://github.com/Q-Future/Q-Bench.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1109/TPAMI.2024.3446537
Wei Chang, Huimin Chen, Feiping Nie, Rong Wang, Xuelong Li
Multi-view learning has raised more and more attention in recent years. However, traditional approaches only focus on the difference while ignoring the consistency among views. It may make some views, with the situation of data abnormality or noise, ineffective in the progress of view learning. Besides, the current datasets have become high-dimensional and large-scale gradually. Therefore, this paper proposes a novel multi-view compressed subspace learning method via low-rank tensor constraint, which incorporates the clustering progress and multi-view learning into a unified framework. First, for each view, we take the partial samples to build a small-size dictionary, which can reduce the effect of both redundancy information and computation cost greatly. Then, to find the consistency and difference among views, we impose a low-rank tensor constraint on these representations and further design an auto-weighted mechanism to learn the optimal representation. Last, due to the non-square of the learned representation, the bipartite graph has been introduced, and under the structured constraint, the clustering results can be obtained directly from this graph without any post-processing. Extensive experiments on synthetic and real-world benchmark datasets demonstrate the efficacy and efficiency of our method, especially for the views with noise or outliers.
{"title":"Tensorized and Compressed Multi-view Subspace Clustering via Structured Constraint.","authors":"Wei Chang, Huimin Chen, Feiping Nie, Rong Wang, Xuelong Li","doi":"10.1109/TPAMI.2024.3446537","DOIUrl":"10.1109/TPAMI.2024.3446537","url":null,"abstract":"<p><p>Multi-view learning has raised more and more attention in recent years. However, traditional approaches only focus on the difference while ignoring the consistency among views. It may make some views, with the situation of data abnormality or noise, ineffective in the progress of view learning. Besides, the current datasets have become high-dimensional and large-scale gradually. Therefore, this paper proposes a novel multi-view compressed subspace learning method via low-rank tensor constraint, which incorporates the clustering progress and multi-view learning into a unified framework. First, for each view, we take the partial samples to build a small-size dictionary, which can reduce the effect of both redundancy information and computation cost greatly. Then, to find the consistency and difference among views, we impose a low-rank tensor constraint on these representations and further design an auto-weighted mechanism to learn the optimal representation. Last, due to the non-square of the learned representation, the bipartite graph has been introduced, and under the structured constraint, the clustering results can be obtained directly from this graph without any post-processing. Extensive experiments on synthetic and real-world benchmark datasets demonstrate the efficacy and efficiency of our method, especially for the views with noise or outliers.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142010127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many machine learning problems can be formulated as non-convex multi-player games. Due to non-convexity, it is challenging to obtain the existence condition of the global Nash equilibrium (NE) and design theoretically guaranteed algorithms. This paper studies a class of non-convex multi-player games, where players' payoff functions consist of canonical functions and quadratic operators. We leverage conjugate properties to transform the complementary problem into a variational inequality (VI) problem using a continuous pseudo-gradient mapping. We prove the existence condition of the global NE as the solution to the VI problem satisfies a duality relation. We then design an ordinary differential equation to approach the global NE with an exponential convergence rate. For practical implementation, we derive a discretized algorithm and apply it to two scenarios: multi-player games with generalized monotonicity and multi-player potential games. In the two settings, step sizes are required to be O(1/k) and O(1/√k) to yield the convergence rates of O(1/ k) and O(1/√k), respectively. Extensive experiments on robust neural network training and sensor network localization validate our theory. Our code is available at https://github.com/GuanpuChen/Global-NE.
许多机器学习问题都可以表述为非凸多玩家博弈。由于非凸性,获得全局纳什均衡(NE)的存在条件和设计理论上有保证的算法是一项挑战。本文研究了一类非凸多玩家博弈,其中玩家的报酬函数由典型函数和二次算子组成。我们利用共轭特性,使用连续伪梯度映射将互补问题转化为变不等式(VI)问题。我们证明了全局 NE 的存在条件,因为 VI 问题的解满足对偶关系。然后,我们设计了一个常微分方程,以指数收敛速度逼近全局近似值。在实际应用中,我们推导出一种离散化算法,并将其应用于两种情况:具有广义单调性的多人博弈和多人潜在博弈。在这两种情况下,步长要求分别为 O(1/k) 和 O(1/√k),收敛率分别为 O(1/k) 和 O(1/√k)。鲁棒神经网络训练和传感器网络定位的大量实验验证了我们的理论。我们的代码见 https://github.com/GuanpuChen/Global-NE。
{"title":"Approaching the Global Nash Equilibrium of Non-convex Multi-player Games.","authors":"Guanpu Chen, Gehui Xu, Fengxiang He, Yiguang Hong, Leszek Rutkowski, Dacheng Tao","doi":"10.1109/TPAMI.2024.3445666","DOIUrl":"10.1109/TPAMI.2024.3445666","url":null,"abstract":"<p><p>Many machine learning problems can be formulated as non-convex multi-player games. Due to non-convexity, it is challenging to obtain the existence condition of the global Nash equilibrium (NE) and design theoretically guaranteed algorithms. This paper studies a class of non-convex multi-player games, where players' payoff functions consist of canonical functions and quadratic operators. We leverage conjugate properties to transform the complementary problem into a variational inequality (VI) problem using a continuous pseudo-gradient mapping. We prove the existence condition of the global NE as the solution to the VI problem satisfies a duality relation. We then design an ordinary differential equation to approach the global NE with an exponential convergence rate. For practical implementation, we derive a discretized algorithm and apply it to two scenarios: multi-player games with generalized monotonicity and multi-player potential games. In the two settings, step sizes are required to be O(1/k) and O(1/√k) to yield the convergence rates of O(1/ k) and O(1/√k), respectively. Extensive experiments on robust neural network training and sensor network localization validate our theory. Our code is available at https://github.com/GuanpuChen/Global-NE.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (e.g., social network analysis and recommender systems), computer vision (e.g., object detection and point cloud learning), and natural language processing (e.g., relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.
{"title":"A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective.","authors":"Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han, Yizhou Yu","doi":"10.1109/TPAMI.2024.3445463","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3445463","url":null,"abstract":"<p><p>Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (e.g., social network analysis and recommender systems), computer vision (e.g., object detection and point cloud learning), and natural language processing (e.g., relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention. The code will be released upon paper acceptance.
{"title":"Unsupervised Part Discovery via Dual Representation Alignment.","authors":"Jiahao Xia, Wenjian Huang, Min Xu, Jianguo Zhang, Haimin Zhang, Ziyu Sheng, Dong Xu","doi":"10.1109/TPAMI.2024.3445582","DOIUrl":"10.1109/TPAMI.2024.3445582","url":null,"abstract":"<p><p>Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention. The code will be released upon paper acceptance.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised Domain Adaptation (UDA) methods have been successful in reducing label dependency by minimizing the domain discrepancy between labeled source domains and unlabeled target domains. However, these methods face challenges when dealing with Multivariate Time-Series (MTS) data. MTS data typically originates from multiple sensors, each with its unique distribution. This property poses difficulties in adapting existing UDA techniques, which mainly focus on aligning global features while overlooking the distribution discrepancies at the sensor level, thus limiting their effectiveness for MTS data. To address this issue, a practical domain adaptation scenario is formulated as Multivariate Time-Series Unsupervised Domain Adaptation (MTS-UDA). In this paper, we propose SEnsor Alignment (SEA) for MTS-UDA, aiming to address domain discrepancy at both local and global sensor levels. At the local sensor level, we design endo-feature alignment, which aligns sensor features and their correlations across domains. To reduce domain discrepancy at the global sensor level, we design exo-feature alignment that enforces restrictions on global sensor features. We further extend SEA to SEA++ by enhancing the endo-feature alignment. Particularly, we incorporate multi-graph-based higher-order alignment for both sensor features and their correlations. Extensive empirical results have demonstrated the state-of-the-art performance of our SEA and SEA++ on six public MTS datasets for MTS-UDA.
{"title":"SEA++: Multi-Graph-based Higher-Order Sensor Alignment for Multivariate Time-Series Unsupervised Domain Adaptation.","authors":"Yucheng Wang, Yuecong Xu, Jianfei Yang, Min Wu, Xiaoli Li, Lihua Xie, Zhenghua Chen","doi":"10.1109/TPAMI.2024.3444904","DOIUrl":"10.1109/TPAMI.2024.3444904","url":null,"abstract":"<p><p>Unsupervised Domain Adaptation (UDA) methods have been successful in reducing label dependency by minimizing the domain discrepancy between labeled source domains and unlabeled target domains. However, these methods face challenges when dealing with Multivariate Time-Series (MTS) data. MTS data typically originates from multiple sensors, each with its unique distribution. This property poses difficulties in adapting existing UDA techniques, which mainly focus on aligning global features while overlooking the distribution discrepancies at the sensor level, thus limiting their effectiveness for MTS data. To address this issue, a practical domain adaptation scenario is formulated as Multivariate Time-Series Unsupervised Domain Adaptation (MTS-UDA). In this paper, we propose SEnsor Alignment (SEA) for MTS-UDA, aiming to address domain discrepancy at both local and global sensor levels. At the local sensor level, we design endo-feature alignment, which aligns sensor features and their correlations across domains. To reduce domain discrepancy at the global sensor level, we design exo-feature alignment that enforces restrictions on global sensor features. We further extend SEA to SEA++ by enhancing the endo-feature alignment. Particularly, we incorporate multi-graph-based higher-order alignment for both sensor features and their correlations. Extensive empirical results have demonstrated the state-of-the-art performance of our SEA and SEA++ on six public MTS datasets for MTS-UDA.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a conceptually novel, flexible, and effective framework (named T-Net++) for the task of two-view correspondence pruning. T-Net++ comprises two unique structures: the "-'' structure and the "|'' structure. The "-'' structure utilizes an iterative learning strategy to process correspondences, while the "|'' structure integrates all feature information of the "-'' structure and produces inlier weights. Moreover, within the "|'' structure, we design a new Local-Global Attention Fusion module to fully exploit valuable information obtained from concatenating features through channel-wise and spatial-wise relationships. Furthermore, we develop a Channel-Spatial Squeeze-and-Excitation module, a modified network backbone that enhances the representation ability of important channels and correspondences through the squeeze-and-excitation operation. T-Net++ not only preserves the permutation-equivariance manner for correspondence pruning, but also gathers rich contextual information, thereby enhancing the effectiveness of the network. Experimental results demonstrate that T-Net++ outperforms other state-of-the-art correspondence pruning methods on various benchmarks and excels in two extended tasks. Our code will be available at https://github.com/guobaoxiao/T-Net.
{"title":"T-Net++: Effective Permutation-Equivariance Network for Two-View Correspondence Pruning.","authors":"Guobao Xiao, Xin Liu, Zhen Zhong, Xiaoqin Zhang, Jiayi Ma, Haibin Ling","doi":"10.1109/TPAMI.2024.3444457","DOIUrl":"10.1109/TPAMI.2024.3444457","url":null,"abstract":"<p><p>We propose a conceptually novel, flexible, and effective framework (named T-Net++) for the task of two-view correspondence pruning. T-Net++ comprises two unique structures: the \"-<sup>''</sup> structure and the \"|<sup>''</sup> structure. The \"-<sup>''</sup> structure utilizes an iterative learning strategy to process correspondences, while the \"|<sup>''</sup> structure integrates all feature information of the \"-<sup>''</sup> structure and produces inlier weights. Moreover, within the \"|<sup>''</sup> structure, we design a new Local-Global Attention Fusion module to fully exploit valuable information obtained from concatenating features through channel-wise and spatial-wise relationships. Furthermore, we develop a Channel-Spatial Squeeze-and-Excitation module, a modified network backbone that enhances the representation ability of important channels and correspondences through the squeeze-and-excitation operation. T-Net++ not only preserves the permutation-equivariance manner for correspondence pruning, but also gathers rich contextual information, thereby enhancing the effectiveness of the network. Experimental results demonstrate that T-Net++ outperforms other state-of-the-art correspondence pruning methods on various benchmarks and excels in two extended tasks. Our code will be available at https://github.com/guobaoxiao/T-Net.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}