Pub Date : 2024-12-30DOI: 10.1109/TPAMI.2024.3523675
Jiayi Ji;Haowei Wang;Changli Wu;Yiwei Ma;Xiaoshuai Sun;Rongrong Ji
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.
{"title":"JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues","authors":"Jiayi Ji;Haowei Wang;Changli Wu;Yiwei Ma;Xiaoshuai Sun;Rongrong Ji","doi":"10.1109/TPAMI.2024.3523675","DOIUrl":"10.1109/TPAMI.2024.3523675","url":null,"abstract":"The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2475-2492"},"PeriodicalIF":0.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30DOI: 10.1109/TPAMI.2024.3524180
Chinthaka Dinesh;Gene Cheung;Saghar Bagheri;Ivan V. Bajić
A basic premise in graph signal processing (GSP) is that a graph encoding pairwise (anti-)correlations of the targeted signal as edge weights is leveraged for graph filtering. Existing fast graph sampling schemes are designed and tested only for positive graphs describing positive correlations. However, there are many real-world datasets exhibiting strong anti-correlations, and thus a suitable model is a signed graph, containing both positive and negative edge weights. In this paper, we propose the first linear-time method for sampling signed graphs, centered on the concept of balanced signed graphs. Specifically, given an empirical covariance data matrix $bar{{mathbf {C}}}$, we first learn a sparse inverse matrix ${mathcal {L}}$, interpreted as a graph Laplacian corresponding to a signed graph ${mathcal {G}}$. We approximate ${mathcal {G}}$ with a balanced signed graph ${mathcal {G}}^{b}$ via fast edge weight augmentation in linear time, where the eigenpairs of Laplacian ${mathcal {L}}^{b}$ for ${mathcal {G}}^{b}$ are graph frequencies. Next, we select a node subset for sampling to minimize the error of the signal interpolated from samples in two steps. We first align all Gershgorin disc left-ends of Laplacian ${mathcal {L}}^{b}$ at the smallest eigenvalue $lambda _{min }({mathcal {L}}^{b})$ via similarity transform ${mathcal {L}}^{s} = {mathbf {S}}{mathcal {L}}^{b} {mathbf {S}}^{-1}$, leveraging a recent linear algebra theorem called Gershgorin disc perfect alignment (GDPA). We then perform sampling on ${mathcal {L}}^{s}$ using a previous fast Gershgorin disc alignment sampling (GDAS) scheme. Experiments show that our signed graph sampling method outperformed fast sampling schemes designed for positive graphs on various datasets with anti-correlations.
{"title":"Efficient Signed Graph Sampling via Balancing & Gershgorin Disc Perfect Alignment","authors":"Chinthaka Dinesh;Gene Cheung;Saghar Bagheri;Ivan V. Bajić","doi":"10.1109/TPAMI.2024.3524180","DOIUrl":"10.1109/TPAMI.2024.3524180","url":null,"abstract":"A basic premise in graph signal processing (GSP) is that a graph encoding pairwise (anti-)correlations of the targeted signal as edge weights is leveraged for graph filtering. Existing fast graph sampling schemes are designed and tested only for positive graphs describing positive correlations. However, there are many real-world datasets exhibiting strong anti-correlations, and thus a suitable model is a signed graph, containing both positive and negative edge weights. In this paper, we propose the first linear-time method for sampling signed graphs, centered on the concept of balanced signed graphs. Specifically, given an empirical covariance data matrix <inline-formula><tex-math>$bar{{mathbf {C}}}$</tex-math></inline-formula>, we first learn a sparse inverse matrix <inline-formula><tex-math>${mathcal {L}}$</tex-math></inline-formula>, interpreted as a graph Laplacian corresponding to a signed graph <inline-formula><tex-math>${mathcal {G}}$</tex-math></inline-formula>. We approximate <inline-formula><tex-math>${mathcal {G}}$</tex-math></inline-formula> with a balanced signed graph <inline-formula><tex-math>${mathcal {G}}^{b}$</tex-math></inline-formula> via fast edge weight augmentation in linear time, where the eigenpairs of Laplacian <inline-formula><tex-math>${mathcal {L}}^{b}$</tex-math></inline-formula> for <inline-formula><tex-math>${mathcal {G}}^{b}$</tex-math></inline-formula> are graph frequencies. Next, we select a node subset for sampling to minimize the error of the signal interpolated from samples in two steps. We first align all Gershgorin disc left-ends of Laplacian <inline-formula><tex-math>${mathcal {L}}^{b}$</tex-math></inline-formula> at the smallest eigenvalue <inline-formula><tex-math>$lambda _{min }({mathcal {L}}^{b})$</tex-math></inline-formula> via similarity transform <inline-formula><tex-math>${mathcal {L}}^{s} = {mathbf {S}}{mathcal {L}}^{b} {mathbf {S}}^{-1}$</tex-math></inline-formula>, leveraging a recent linear algebra theorem called Gershgorin disc perfect alignment (GDPA). We then perform sampling on <inline-formula><tex-math>${mathcal {L}}^{s}$</tex-math></inline-formula> using a previous fast Gershgorin disc alignment sampling (GDAS) scheme. Experiments show that our signed graph sampling method outperformed fast sampling schemes designed for positive graphs on various datasets with anti-correlations.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2330-2348"},"PeriodicalIF":0.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce Hyper-YOLO, a new object detection method that integrates hypergraph computations to capture the complex high-order correlations among visual features. Traditional YOLO models, while powerful, have limitations in their neck designs that restrict the integration of cross-level features and the exploitation of high-order feature interrelationships. To address these challenges, we propose the Hypergraph Computation Empowered Semantic Collecting and Scattering (HGC-SCS) framework, which transposes visual feature maps into a semantic space and constructs a hypergraph for high-order message propagation. This enables the model to acquire both semantic and structural information, advancing beyond conventional feature-focused learning. Hyper-YOLO incorporates the proposed Mixed Aggregation Network (MANet) in its backbone for enhanced feature extraction and introduces the Hypergraph-Based Cross-Level and Cross-Position Representation Network (HyperC2Net) in its neck. HyperC2Net operates across five scales and breaks free from traditional grid structures, allowing for sophisticated high-order interactions across levels and positions. This synergy of components positions Hyper-YOLO as a state-of-the-art architecture in various scale models, as evidenced by its superior performance on the COCO dataset. Specifically, Hyper-YOLO-N significantly outperforms the advanced YOLOv8-N and YOLOv9-T with 12% $text{AP}^{val}$ and 9% $text{AP}^{val}$ improvements.
{"title":"Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation","authors":"Yifan Feng;Jiangang Huang;Shaoyi Du;Shihui Ying;Jun-Hai Yong;Yipeng Li;Guiguang Ding;Rongrong Ji;Yue Gao","doi":"10.1109/TPAMI.2024.3524377","DOIUrl":"10.1109/TPAMI.2024.3524377","url":null,"abstract":"We introduce Hyper-YOLO, a new object detection method that integrates hypergraph computations to capture the complex high-order correlations among visual features. Traditional YOLO models, while powerful, have limitations in their neck designs that restrict the integration of cross-level features and the exploitation of high-order feature interrelationships. To address these challenges, we propose the Hypergraph Computation Empowered Semantic Collecting and Scattering (HGC-SCS) framework, which transposes visual feature maps into a semantic space and constructs a hypergraph for high-order message propagation. This enables the model to acquire both semantic and structural information, advancing beyond conventional feature-focused learning. Hyper-YOLO incorporates the proposed Mixed Aggregation Network (MANet) in its backbone for enhanced feature extraction and introduces the Hypergraph-Based Cross-Level and Cross-Position Representation Network (HyperC2Net) in its neck. HyperC2Net operates across five scales and breaks free from traditional grid structures, allowing for sophisticated high-order interactions across levels and positions. This synergy of components positions Hyper-YOLO as a state-of-the-art architecture in various scale models, as evidenced by its superior performance on the COCO dataset. Specifically, Hyper-YOLO-N significantly outperforms the advanced YOLOv8-N and YOLOv9-T with 12% <inline-formula><tex-math>$text{AP}^{val}$</tex-math></inline-formula> and 9% <inline-formula><tex-math>$text{AP}^{val}$</tex-math></inline-formula> improvements.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2388-2401"},"PeriodicalIF":0.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30DOI: 10.1109/TPAMI.2024.3524440
Yue Dai;Yifan Feng;Nan Ma;Xibin Zhao;Yue Gao
Cross-modal 3D shape retrieval is a crucial and widely applied task in the field of 3D vision. Its goal is to construct retrieval representations capable of measuring the similarity between instances of different 3D modalities. However, existing methods face challenges due to the performance bottlenecks of single-modal representation extractors and the modality gap across 3D modalities. To tackle these issues, we propose a Heterogeneous Dynamic Graph Representation (HDGR) network, which incorporates context-dependent dynamic relations within a heterogeneous framework. By capturing correlations among diverse 3D objects, HDGR overcomes the limitations of ambiguous representations obtained solely from instances. Within the context of varying mini-batches, dynamic graphs are constructed to capture proximal intra-modal relations, and dynamic bipartite graphs represent implicit cross-modal relations, effectively addressing the two challenges above. Subsequently, message passing and aggregation are performed using Dynamic Graph Convolution (DGConv) and Dynamic Bipartite Graph Convolution (DBConv), enhancing features through heterogeneous dynamic relation learning. Finally, intra-modal, cross-modal, and self-transformed features are redistributed and integrated into a heterogeneous dynamic representation for cross-modal 3D shape retrieval. HDGR establishes a stable, context-enhanced, structure-aware 3D shape representation by capturing heterogeneous inter-object relationships and adapting to varying contextual dynamics. Extensive experiments conducted on the ModelNet10, ModelNet40, and real-world ABO datasets demonstrate the state-of-the-art performance of HDGR in cross-modal and intra-modal retrieval tasks. Moreover, under the supervision of robust loss functions, HDGR achieves remarkable cross-modal retrieval against label noise on the 3D MNIST dataset. The comprehensive experimental results highlight the effectiveness and efficiency of HDGR on cross-modal 3D shape retrieval.
{"title":"Cross-Modal 3D Shape Retrieval via Heterogeneous Dynamic Graph Representation","authors":"Yue Dai;Yifan Feng;Nan Ma;Xibin Zhao;Yue Gao","doi":"10.1109/TPAMI.2024.3524440","DOIUrl":"10.1109/TPAMI.2024.3524440","url":null,"abstract":"Cross-modal 3D shape retrieval is a crucial and widely applied task in the field of 3D vision. Its goal is to construct retrieval representations capable of measuring the similarity between instances of different 3D modalities. However, existing methods face challenges due to the performance bottlenecks of single-modal representation extractors and the modality gap across 3D modalities. To tackle these issues, we propose a Heterogeneous Dynamic Graph Representation (HDGR) network, which incorporates context-dependent dynamic relations within a heterogeneous framework. By capturing correlations among diverse 3D objects, HDGR overcomes the limitations of ambiguous representations obtained solely from instances. Within the context of varying mini-batches, dynamic graphs are constructed to capture proximal intra-modal relations, and dynamic bipartite graphs represent implicit cross-modal relations, effectively addressing the two challenges above. Subsequently, message passing and aggregation are performed using Dynamic Graph Convolution (DGConv) and Dynamic Bipartite Graph Convolution (DBConv), enhancing features through heterogeneous dynamic relation learning. Finally, intra-modal, cross-modal, and self-transformed features are redistributed and integrated into a heterogeneous dynamic representation for cross-modal 3D shape retrieval. HDGR establishes a stable, context-enhanced, structure-aware 3D shape representation by capturing heterogeneous inter-object relationships and adapting to varying contextual dynamics. Extensive experiments conducted on the ModelNet10, ModelNet40, and real-world ABO datasets demonstrate the state-of-the-art performance of HDGR in cross-modal and intra-modal retrieval tasks. Moreover, under the supervision of robust loss functions, HDGR achieves remarkable cross-modal retrieval against label noise on the 3D MNIST dataset. The comprehensive experimental results highlight the effectiveness and efficiency of HDGR on cross-modal 3D shape retrieval.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2370-2387"},"PeriodicalIF":0.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-27DOI: 10.1109/TPAMI.2024.3522305
Fengxiang Bie;Yibo Yang;Zhongzhu Zhou;Adam Ghanem;Minjia Zhang;Zhewei Yao;Xiaoxia Wu;Connor Holmes;Pareesa Golnari;David A. Clifton;Yuxiong He;Dacheng Tao;Shuaiwen Leon Song
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.
{"title":"RenAIssance: A Survey Into AI Text-to-Image Generation in the Era of Large Model","authors":"Fengxiang Bie;Yibo Yang;Zhongzhu Zhou;Adam Ghanem;Minjia Zhang;Zhewei Yao;Xiaoxia Wu;Connor Holmes;Pareesa Golnari;David A. Clifton;Yuxiong He;Dacheng Tao;Shuaiwen Leon Song","doi":"10.1109/TPAMI.2024.3522305","DOIUrl":"10.1109/TPAMI.2024.3522305","url":null,"abstract":"Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2212-2231"},"PeriodicalIF":0.0,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/TPAMI.2024.3521478
Chuanxing Geng;Aiyang Han;Songcan Chen
Consistency and complementarity are two key ingredients for boosting multi-view clustering (MVC). Recently with the introduction of popular contrastive learning, the consistency learning of views has been further enhanced in MVC, leading to promising performance. However, by contrast, the complementarity has not received sufficient attention except just in the feature facet, where the Hilbert Schmidt Independence Criterion term or the independent encoder-decoder network is usually adopted to capture view-specific information. This motivates us to reconsider the complementarity learning of views comprehensively from multiple facets including the feature-, view-label- and contrast- facets, while maintaining the view consistency. We empirically find that all the facets contribute to the complementarity learning, especially the view-label facet, which is usually neglected by existing methods. Based on this, a simple yet effective Multifacet Complementarity learning framework for Multi-View Clustering (MCMVC) is naturally developed, which fuses multifacet complementarity information, especially explicitly embedding the view-label information. To our best knowledge, it is the first time to use view-labels explicitly to guide the complementarity learning of views. Compared with the SOTA baselines, MCMVC achieves remarkable improvements, e.g., by average margins over 5.00% and 7.00% respectively in complete and incomplete MVC settings on Caltech101-20 in terms of three evaluation metrics.
{"title":"Explicit View-Labels Matter: A Multifacet Complementarity Study of Multi-View Clustering","authors":"Chuanxing Geng;Aiyang Han;Songcan Chen","doi":"10.1109/TPAMI.2024.3521478","DOIUrl":"10.1109/TPAMI.2024.3521478","url":null,"abstract":"Consistency and complementarity are two key ingredients for boosting multi-view clustering (MVC). Recently with the introduction of popular contrastive learning, the consistency learning of views has been further enhanced in MVC, leading to promising performance. However, by contrast, the complementarity has not received sufficient attention except just in the feature facet, where the Hilbert Schmidt Independence Criterion term or the independent encoder-decoder network is usually adopted to capture view-specific information. This motivates us to reconsider the complementarity learning of views comprehensively from multiple facets including the feature-, view-label- and contrast- facets, while maintaining the view consistency. We empirically find that all the facets contribute to the complementarity learning, especially the view-label facet, which is usually neglected by existing methods. Based on this, a simple yet effective <underline>M</u>ultifacet <underline>C</u>omplementarity learning framework for <underline>M</u>ulti-<underline>V</u>iew <underline>C</u>lustering (MCMVC) is naturally developed, which fuses multifacet complementarity information, especially explicitly embedding the view-label information. To our best knowledge, it is the first time to use view-labels explicitly to guide the complementarity learning of views. Compared with the SOTA baselines, MCMVC achieves remarkable improvements, e.g., by average margins over 5.00% and 7.00% respectively in complete and incomplete MVC settings on Caltech101-20 in terms of three evaluation metrics.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2520-2532"},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/TPAMI.2024.3522994
Tianxin Xie;Hu Han;Shiguang Shan;Xilin Chen
Facial recognition (FR) technology offers convenience in our daily lives, but it also raises serious privacy issues due to unauthorized FR applications. To protect facial privacy, existing methods have proposed adversarial face examples that can fool FR systems. However, most of these methods work only in the digital domain and do not consider natural physical protections. In this paper, we present NatMask, a 3D-based method for creating natural and realistic adversarial face masks that can preserve facial identity in the physical world. Our method utilizes 3D face reconstruction and differentiable rendering to generate 2D face images with natural-looking facial masks. Moreover, we propose an identity-aware style injection (IASI) method to improve the naturalness and transferability of the mask texture. We evaluate our method on two face datasets to verify its effectiveness in protecting face identity against four state-of-the-art (SOTA) FR models and three commercial FR APIs in both digital and physical domains under black-box impersonation and dodging strategies. Experiments show that our method can generate adversarial masks with superior naturalness and physical realizability to safeguard face identity, outperforming SOTA methods by a large margin.
{"title":"Natural Adversarial Mask for Face Identity Protection in Physical World","authors":"Tianxin Xie;Hu Han;Shiguang Shan;Xilin Chen","doi":"10.1109/TPAMI.2024.3522994","DOIUrl":"10.1109/TPAMI.2024.3522994","url":null,"abstract":"Facial recognition (FR) technology offers convenience in our daily lives, but it also raises serious privacy issues due to unauthorized FR applications. To protect facial privacy, existing methods have proposed adversarial face examples that can fool FR systems. However, most of these methods work only in the digital domain and do not consider natural physical protections. In this paper, we present NatMask, a 3D-based method for creating natural and realistic adversarial face masks that can preserve facial identity in the physical world. Our method utilizes 3D face reconstruction and differentiable rendering to generate 2D face images with natural-looking facial masks. Moreover, we propose an identity-aware style injection (IASI) method to improve the naturalness and transferability of the mask texture. We evaluate our method on two face datasets to verify its effectiveness in protecting face identity against four state-of-the-art (SOTA) FR models and three commercial FR APIs in both digital and physical domains under black-box impersonation and dodging strategies. Experiments show that our method can generate adversarial masks with superior naturalness and physical realizability to safeguard face identity, outperforming SOTA methods by a large margin.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2089-2106"},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/TPAMI.2024.3522258
Shipeng Wang;Xiaorong Li;Jian Sun;Zongben Xu
In the context of incremental learning, a network is sequentially trained on a stream of tasks, where data from previous tasks are particularly assumed to be inaccessible. The major challenge is how to overcome the stability-plasticity dilemma, i.e., learning knowledge from new tasks without forgetting the knowledge of previous tasks. To this end, we propose two mathematical conditions for guaranteeing network stability and plasticity with theoretical analysis. The conditions demonstrate that we can restrict the parameter update in the null space of uncentered feature covariance at each linear layer to overcome the stability-plasticity dilemma, which can be realized by layerwise projecting gradient into the null space. Inspired by it, we develop two algorithms, dubbed Adam-NSCL and Adam-SFCL respectively, for incremental learning. Adam-NSCL and Adam-SFCL provide different ways to compute the projection matrix. The projection matrix in Adam-NSCL is constructed by singular vectors associated with the smallest singular values of the uncentered feature covariance matrix, while the projection matrix in Adam-SFCL is constructed by all singular vectors associated with adaptive scaling factors. Additionally, we explore adopting self-supervised techniques, including self-supervised label augmentation and a newly proposed contrastive loss, to improve the performance of incremental learning. These self-supervised techniques are orthogonal to Adam-NSCL and Adam-SFCL and can be incorporated with them seamlessly, leading to Adam-NSCL-SSL and Adam-SFCL-SSL respectively. The proposed algorithms are applied to task-incremental and class-incremental learning on various benchmark datasets with multiple backbones, and the results show that they outperform the compared incremental learning methods.
{"title":"Training Networks in Null Space of Feature Covariance With Self-Supervision for Incremental Learning","authors":"Shipeng Wang;Xiaorong Li;Jian Sun;Zongben Xu","doi":"10.1109/TPAMI.2024.3522258","DOIUrl":"10.1109/TPAMI.2024.3522258","url":null,"abstract":"In the context of incremental learning, a network is sequentially trained on a stream of tasks, where data from previous tasks are particularly assumed to be inaccessible. The major challenge is how to overcome the stability-plasticity dilemma, i.e., learning knowledge from new tasks without forgetting the knowledge of previous tasks. To this end, we propose two mathematical conditions for guaranteeing network stability and plasticity with theoretical analysis. The conditions demonstrate that we can restrict the parameter update in the null space of uncentered feature covariance at each linear layer to overcome the stability-plasticity dilemma, which can be realized by layerwise projecting gradient into the null space. Inspired by it, we develop two algorithms, dubbed Adam-NSCL and Adam-SFCL respectively, for incremental learning. Adam-NSCL and Adam-SFCL provide different ways to compute the projection matrix. The projection matrix in Adam-NSCL is constructed by singular vectors associated with the smallest singular values of the uncentered feature covariance matrix, while the projection matrix in Adam-SFCL is constructed by all singular vectors associated with adaptive scaling factors. Additionally, we explore adopting self-supervised techniques, including self-supervised label augmentation and a newly proposed contrastive loss, to improve the performance of incremental learning. These self-supervised techniques are orthogonal to Adam-NSCL and Adam-SFCL and can be incorporated with them seamlessly, leading to Adam-NSCL-SSL and Adam-SFCL-SSL respectively. The proposed algorithms are applied to task-incremental and class-incremental learning on various benchmark datasets with multiple backbones, and the results show that they outperform the compared incremental learning methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2563-2580"},"PeriodicalIF":0.0,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The number of categories of instances in the real world is normally huge, and each instance may contain multiple labels. To distinguish these massive labels utilizing machine learning, eXtreme Label Classification (XLC) has been established. However, as the number of categories increases, the number of parameters and nonlinear operations in the classifier also rises. This results in a Classifier Computational Overload Problem (CCOP). To address this, we propose a Multi-Head Encoding (MHE) mechanism, which replaces the vanilla classifier with a multi-head classifier. During the training process, MHE decomposes extreme labels into the product of multiple short local labels, with each head trained on these local labels. During testing, the predicted labels can be directly calculated from the local predictions of each head. This reduces the computational load geometrically. Then, according to the characteristics of different XLC tasks, e.g., single-label, multi-label, and model pretraining tasks, three MHE-based implementations, i.e., Multi-Head Product, Multi-Head Cascade, and Multi-Head Sampling, are proposed to more effectively cope with CCOP. Moreover, we theoretically demonstrate that MHE can achieve performance approximately equivalent to that of the vanilla classifier by generalizing the low-rank approximation problem from Frobenius-norm to Cross-Entropy. Experimental results show that the proposed methods achieve state-of-the-art performance while significantly streamlining the training and inference processes of XLC tasks.
{"title":"Multi-Head Encoding for Extreme Label Classification","authors":"Daojun Liang;Haixia Zhang;Dongfeng Yuan;Minggao Zhang","doi":"10.1109/TPAMI.2024.3522298","DOIUrl":"10.1109/TPAMI.2024.3522298","url":null,"abstract":"The number of categories of instances in the real world is normally huge, and each instance may contain multiple labels. To distinguish these massive labels utilizing machine learning, eXtreme Label Classification (XLC) has been established. However, as the number of categories increases, the number of parameters and nonlinear operations in the classifier also rises. This results in a Classifier Computational Overload Problem (CCOP). To address this, we propose a Multi-Head Encoding (MHE) mechanism, which replaces the vanilla classifier with a multi-head classifier. During the training process, MHE decomposes extreme labels into the product of multiple short local labels, with each head trained on these local labels. During testing, the predicted labels can be directly calculated from the local predictions of each head. This reduces the computational load geometrically. Then, according to the characteristics of different XLC tasks, e.g., single-label, multi-label, and model pretraining tasks, three MHE-based implementations, i.e., Multi-Head Product, Multi-Head Cascade, and Multi-Head Sampling, are proposed to more effectively cope with CCOP. Moreover, we theoretically demonstrate that MHE can achieve performance approximately equivalent to that of the vanilla classifier by generalizing the low-rank approximation problem from Frobenius-norm to Cross-Entropy. Experimental results show that the proposed methods achieve state-of-the-art performance while significantly streamlining the training and inference processes of XLC tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2199-2211"},"PeriodicalIF":0.0,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-24DOI: 10.1109/TPAMI.2024.3522124
Peng Jin;Hao Li;Li Yuan;Shuicheng Yan;Jie Chen
Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.
{"title":"Hierarchical Banzhaf Interaction for General Video-Language Representation Learning","authors":"Peng Jin;Hao Li;Li Yuan;Shuicheng Yan;Jie Chen","doi":"10.1109/TPAMI.2024.3522124","DOIUrl":"10.1109/TPAMI.2024.3522124","url":null,"abstract":"Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2125-2139"},"PeriodicalIF":0.0,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10815073","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142884231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}