Pub Date : 2024-08-08DOI: 10.1109/TIP.2024.3437212
Xin Li, Cuiling Lan, Guoqiang Wei, Zhibo Chen
Vision transformer has demonstrated great potential in abundant vision tasks. However, it also inevitably suffers from poor generalization capability when the distribution shift occurs in testing (i.e., out-of-distribution data). To mitigate this issue, we propose a novel method, Semantic-aware Message Broadcasting (SAMB), which enables more informative and flexible feature alignment for unsupervised domain adaptation (UDA). Particularly, we study the attention module in the vision transformer and notice that the alignment space using one global class token lacks enough flexibility, where it interacts information with all image tokens in the same manner but ignores the rich semantics of different regions. In this paper, we aim to improve the richness of the alignment features by enabling semantic-aware adaptive message broadcasting. Particularly, we introduce a group of learned group tokens as nodes to aggregate the global information from all image tokens, but encourage different group tokens to adaptively focus on the message broadcasting to different semantic regions. In this way, our message broadcasting encourages the group tokens to learn more informative and diverse information for effective domain alignment. Moreover, we systematically study the effects of adversarial-based feature alignment (ADA) and pseudo-label based self-training (PST) on UDA. We find that one simple two-stage training strategy with the cooperation of ADA and PST can further improve the adaptation capability of the vision transformer. Extensive experiments on DomainNet, OfficeHome, and VisDA-2017 demonstrate the effectiveness of our methods for UDA.
{"title":"Semantic-aware Message Broadcasting for Efficient Unsupervised Domain Adaptation.","authors":"Xin Li, Cuiling Lan, Guoqiang Wei, Zhibo Chen","doi":"10.1109/TIP.2024.3437212","DOIUrl":"https://doi.org/10.1109/TIP.2024.3437212","url":null,"abstract":"<p><p>Vision transformer has demonstrated great potential in abundant vision tasks. However, it also inevitably suffers from poor generalization capability when the distribution shift occurs in testing (i.e., out-of-distribution data). To mitigate this issue, we propose a novel method, Semantic-aware Message Broadcasting (SAMB), which enables more informative and flexible feature alignment for unsupervised domain adaptation (UDA). Particularly, we study the attention module in the vision transformer and notice that the alignment space using one global class token lacks enough flexibility, where it interacts information with all image tokens in the same manner but ignores the rich semantics of different regions. In this paper, we aim to improve the richness of the alignment features by enabling semantic-aware adaptive message broadcasting. Particularly, we introduce a group of learned group tokens as nodes to aggregate the global information from all image tokens, but encourage different group tokens to adaptively focus on the message broadcasting to different semantic regions. In this way, our message broadcasting encourages the group tokens to learn more informative and diverse information for effective domain alignment. Moreover, we systematically study the effects of adversarial-based feature alignment (ADA) and pseudo-label based self-training (PST) on UDA. We find that one simple two-stage training strategy with the cooperation of ADA and PST can further improve the adaptation capability of the vision transformer. Extensive experiments on DomainNet, OfficeHome, and VisDA-2017 demonstrate the effectiveness of our methods for UDA.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141908632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-view clustering aims to learn discriminative representations from multi-view data. Although existing methods show impressive performance by leveraging contrastive learning to tackle the representation gap between every two views, they share the common limitation of not performing semantic alignment from a global perspective, resulting in the undermining of semantic patterns in multi-view data. This paper presents CSOT, namely Common Semantics via Optimal Transport, to boost contrastive multi-view clustering via semantic learning in a common space that integrates all views. Through optimal transport, the samples in multiple views are mapped to the joint clusters which represent the multi-view semantic patterns in the common space. With the semantic assignment derived from the optimal transport plan, we design a semantic learning module where the soft assignment vector works as a global supervision to enforce the model to learn consistent semantics among all views. Moreover, we propose a semantic-aware re-weighting strategy to treat samples differently according to their semantic significance, which improves the effectiveness of cross-view contrastive representation learning. Extensive experimental results demonstrate that CSOT achieves the state-of-the-art clustering performance.
{"title":"Learning Common Semantics via Optimal Transport for Contrastive Multi-View Clustering","authors":"Qian Zhang;Lin Zhang;Ran Song;Runmin Cong;Yonghuai Liu;Wei Zhang","doi":"10.1109/TIP.2024.3436615","DOIUrl":"10.1109/TIP.2024.3436615","url":null,"abstract":"Multi-view clustering aims to learn discriminative representations from multi-view data. Although existing methods show impressive performance by leveraging contrastive learning to tackle the representation gap between every two views, they share the common limitation of not performing semantic alignment from a global perspective, resulting in the undermining of semantic patterns in multi-view data. This paper presents CSOT, namely Common Semantics via Optimal Transport, to boost contrastive multi-view clustering via semantic learning in a common space that integrates all views. Through optimal transport, the samples in multiple views are mapped to the joint clusters which represent the multi-view semantic patterns in the common space. With the semantic assignment derived from the optimal transport plan, we design a semantic learning module where the soft assignment vector works as a global supervision to enforce the model to learn consistent semantics among all views. Moreover, we propose a semantic-aware re-weighting strategy to treat samples differently according to their semantic significance, which improves the effectiveness of cross-view contrastive representation learning. Extensive experimental results demonstrate that CSOT achieves the state-of-the-art clustering performance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141908631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1109/TIP.2024.3437219
Hoang Trieu Vy Le;Audrey Repetti;Nelly Pustelnik
A common approach to solve inverse imaging problems relies on finding a maximum a posteriori (MAP) estimate of the original unknown image, by solving a minimization problem. In this context, iterative proximal algorithms are widely used, enabling to handle non-smooth functions and linear operators. Recently, these algorithms have been paired with deep learning strategies, to further improve the estimate quality. In particular, proximal neural networks (PNNs) have been introduced, obtained by unrolling a proximal algorithm as for finding a MAP estimate, but over a fixed number of iterations, with learned linear operators and parameters. As PNNs are based on optimization theory, they are very flexible, and can be adapted to any image restoration task, as soon as a proximal algorithm can solve it. They further have much lighter architectures than traditional networks. In this article we propose a unified framework to build PNNs for the Gaussian denoising task, based on both the dual-FB and the primal-dual Chambolle-Pock algorithms. We further show that accelerated inertial versions of these algorithms enable skip connections in the associated NN layers. We propose different learning strategies for our PNN framework, and investigate their robustness (Lipschitz property) and denoising efficiency. Finally, we assess the robustness of our PNNs when plugged in a forward-backward algorithm for an image deblurring problem.
{"title":"Unfolded Proximal Neural Networks for Robust Image Gaussian Denoising","authors":"Hoang Trieu Vy Le;Audrey Repetti;Nelly Pustelnik","doi":"10.1109/TIP.2024.3437219","DOIUrl":"10.1109/TIP.2024.3437219","url":null,"abstract":"A common approach to solve inverse imaging problems relies on finding a maximum a posteriori (MAP) estimate of the original unknown image, by solving a minimization problem. In this context, iterative proximal algorithms are widely used, enabling to handle non-smooth functions and linear operators. Recently, these algorithms have been paired with deep learning strategies, to further improve the estimate quality. In particular, proximal neural networks (PNNs) have been introduced, obtained by unrolling a proximal algorithm as for finding a MAP estimate, but over a fixed number of iterations, with learned linear operators and parameters. As PNNs are based on optimization theory, they are very flexible, and can be adapted to any image restoration task, as soon as a proximal algorithm can solve it. They further have much lighter architectures than traditional networks. In this article we propose a unified framework to build PNNs for the Gaussian denoising task, based on both the dual-FB and the primal-dual Chambolle-Pock algorithms. We further show that accelerated inertial versions of these algorithms enable skip connections in the associated NN layers. We propose different learning strategies for our PNN framework, and investigate their robustness (Lipschitz property) and denoising efficiency. Finally, we assess the robustness of our PNNs when plugged in a forward-backward algorithm for an image deblurring problem.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141903949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-06DOI: 10.1109/TIP.2024.3422881
Teng Yang;Song Xiao;Jiahui Qu;Wenqian Dong;Qian Du;Yunsong Li
The unsupervised domain adaptation (UDA) based cross-scene remote sensing image classification has recently become an appealing research topic, since it is a valid solution to unsupervised scene classification by exploiting well-labeled data from another scene. Despite its good performance in reducing domain shifts, UDA in multisource data scenarios is hindered by several critical challenges. The first one is the heterogeneity inherent in multisource data complicates domain alignment. The second challenge is the incomplete representation of feature distribution caused by the neglect of the contribution from global information. The third challenge is the inaccuracies in alignment due to errors in establishing target domain conditional distributions. Since UDA does not guarantee the complete consistency of the distribution of the two domains, networks using simple classifiers are still affected by domain shifts, resulting in poor performance. In this paper, we propose a graph embedding interclass relation-aware adaptive network (GeIraA-Net) for unsupervised classification of multi-source remote sensing data, which facilitates knowledge transfer at the class level for two domains by leveraging aligned features to perceive inter-class relation. More specifically, a graph-based progressive hierarchical feature extraction network is constructed, capable of capturing both local and global features of multisource data, thereby consolidating comprehensive domain information within a unified feature space. To deal with the imprecise alignment of data distribution, a joint de-scrambling alignment strategy is designed to utilize the features obtained by a three-step pseudo-label generation module for more delicate domain calibration. Moreover, an adaptive inter-class topology based classifier is constructed to further improve the classification accuracy by making the classifier domain adaptive at the category level. The experimental results show that GeIraA-Net has significant advantages over the current state-of-the-art cross-scene classification methods.
{"title":"Graph Embedding Interclass Relation-Aware Adaptive Network for Cross-Scene Classification of Multisource Remote Sensing Data","authors":"Teng Yang;Song Xiao;Jiahui Qu;Wenqian Dong;Qian Du;Yunsong Li","doi":"10.1109/TIP.2024.3422881","DOIUrl":"10.1109/TIP.2024.3422881","url":null,"abstract":"The unsupervised domain adaptation (UDA) based cross-scene remote sensing image classification has recently become an appealing research topic, since it is a valid solution to unsupervised scene classification by exploiting well-labeled data from another scene. Despite its good performance in reducing domain shifts, UDA in multisource data scenarios is hindered by several critical challenges. The first one is the heterogeneity inherent in multisource data complicates domain alignment. The second challenge is the incomplete representation of feature distribution caused by the neglect of the contribution from global information. The third challenge is the inaccuracies in alignment due to errors in establishing target domain conditional distributions. Since UDA does not guarantee the complete consistency of the distribution of the two domains, networks using simple classifiers are still affected by domain shifts, resulting in poor performance. In this paper, we propose a graph embedding interclass relation-aware adaptive network (GeIraA-Net) for unsupervised classification of multi-source remote sensing data, which facilitates knowledge transfer at the class level for two domains by leveraging aligned features to perceive inter-class relation. More specifically, a graph-based progressive hierarchical feature extraction network is constructed, capable of capturing both local and global features of multisource data, thereby consolidating comprehensive domain information within a unified feature space. To deal with the imprecise alignment of data distribution, a joint de-scrambling alignment strategy is designed to utilize the features obtained by a three-step pseudo-label generation module for more delicate domain calibration. Moreover, an adaptive inter-class topology based classifier is constructed to further improve the classification accuracy by making the classifier domain adaptive at the category level. The experimental results show that GeIraA-Net has significant advantages over the current state-of-the-art cross-scene classification methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141899233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D