Open-vocabulary Video Instance Segmentation addresses the challenging task of detecting, segmenting, and tracking objects in videos, including categories not encountered during training. However, existing approaches often overlook rich temporal cues from preceding frames, limiting their ability to leverage causal context for robust open-world generalization. To bridge this gap, we propose CPOVIS, a novel framework that introduces causal prompts-dynamically propagated visual and taxonomy prompts from historical frames-to enhance temporal reasoning and semantic consistency. Built upon a Mask2Former architecture with a CLIP backbone, CPOVIS integrates three core innovations: (1) PromptCLIP, which aligns cross-modal embeddings while preserving open-vocabulary capabilities; (2) a Visual Prompt Injector that propagates object-level features to maintain spatial-temporal coherence; and (3) a Taxonomy Prompt Infuser that leverages hierarchical semantic relationships to stabilize unseen category recognition. Furthermore, we introduce a contrastive learning strategy to disentangle object representations across frames and adapt the Segment Anything Model (SAM2) to boost open-vocabulary segmentation and tracking capacity in open-vocabulary video scenarios. Extensive experiments on seven challenging open- and closed-vocabulary video segmentation benchmarks demonstrate CPOVIS's state-of-the-art performance, outperforming existing methods by significant margins. Our findings highlight the critical role of causal prompt propagation in advancing video understanding in open-world scenarios.
{"title":"Causal Prompts for Open-vocabulary Video Instance Segmentation.","authors":"Rongkun Zheng,Lu Qi,Xi Chen,Yi Wang,Kun Wang,Yu Qiao,Hengshuang Zhao","doi":"10.1109/tpami.2026.3669976","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669976","url":null,"abstract":"Open-vocabulary Video Instance Segmentation addresses the challenging task of detecting, segmenting, and tracking objects in videos, including categories not encountered during training. However, existing approaches often overlook rich temporal cues from preceding frames, limiting their ability to leverage causal context for robust open-world generalization. To bridge this gap, we propose CPOVIS, a novel framework that introduces causal prompts-dynamically propagated visual and taxonomy prompts from historical frames-to enhance temporal reasoning and semantic consistency. Built upon a Mask2Former architecture with a CLIP backbone, CPOVIS integrates three core innovations: (1) PromptCLIP, which aligns cross-modal embeddings while preserving open-vocabulary capabilities; (2) a Visual Prompt Injector that propagates object-level features to maintain spatial-temporal coherence; and (3) a Taxonomy Prompt Infuser that leverages hierarchical semantic relationships to stabilize unseen category recognition. Furthermore, we introduce a contrastive learning strategy to disentangle object representations across frames and adapt the Segment Anything Model (SAM2) to boost open-vocabulary segmentation and tracking capacity in open-vocabulary video scenarios. Extensive experiments on seven challenging open- and closed-vocabulary video segmentation benchmarks demonstrate CPOVIS's state-of-the-art performance, outperforming existing methods by significant margins. Our findings highlight the critical role of causal prompt propagation in advancing video understanding in open-world scenarios.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"53 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147346351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/tpami.2026.3669995
Hai Jiang,Haipeng Li,Songchen Han,Bing Zeng,Shuaicheng Liu
In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data for supervised small-baseline and large-baseline homography learning and yield a state-of-the-art homography estimation network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content refinement diffusion model. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method outperforms existing competitors and previous supervised methods can also be improved based on the generated dataset. The code and dataset are available at https://github.com/JianghaiSCU/RealSH.
{"title":"Supervised Small-baseline and Large-baseline Homography Learning with Diffusion-based Data Generation.","authors":"Hai Jiang,Haipeng Li,Songchen Han,Bing Zeng,Shuaicheng Liu","doi":"10.1109/tpami.2026.3669995","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669995","url":null,"abstract":"In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data for supervised small-baseline and large-baseline homography learning and yield a state-of-the-art homography estimation network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content refinement diffusion model. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method outperforms existing competitors and previous supervised methods can also be improved based on the generated dataset. The code and dataset are available at https://github.com/JianghaiSCU/RealSH.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"43 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147346352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/tpami.2026.3669168
Caixing Wang
Random features (RFs) provide an efficient approximation to kernel methods, and allow for scalable learning on large datasets by reducing computational complexity while maintaining strong theoretical guarantees. However, real-world data can often be contaminated by outliers or heavy-tailed noise, which significantly degrades the performance of standard RF algorithms. To address this issue, we propose a robust and adaptive regularized least squares method with random features (RRLS-RF) that incorporates response truncation. The truncation level adaptively balances robustness and bias based on the sample size and moment conditions. We establish the generalization properties of RRLS-RF by assuming only a bounded $(1+delta )$-th moment for any $delta gt 0$. Specifically, our analysis shows that RRLS-RF achieves learning rates of $mathcal {O}(|D|^{-frac{delta }{2delta +2}})$ with only $mathcal {O}(|D|^{frac{delta }{2delta +2}}log |D|)$ random features, where $|D|$ denotes the training sample size. These results converge to the optimal learning rates of $mathcal {O}(|D|^{-frac{1}{2}})$ as $delta rightarrow infty$, covering the traditional boundedness or sub-Gaussian assumptions in the regularized least squares method with random features (RLS-RF). Furthermore, we refine our analysis and show that RRLS-RF can achieve even faster learning rates under source and capacity conditions, as well as a smaller number of RFs with data-dependent sampling strategies. The derived sharp learning rates can also cover the mis-specified settings where the true function may not precisely align with the assumed kernel space. We further establish the first minimax lower bound under the weak moment condition, which shows that the RRLS-RF estimator is optimal over a wide range of source conditions. Our numerical experiments and real data analysis verify the theoretical results and demonstrate the superior robustness of RRLS-RF against outliers and heavy-tailed noise compared to standard methods.
{"title":"Generalization Properties of Robust Learning With Random Features.","authors":"Caixing Wang","doi":"10.1109/tpami.2026.3669168","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669168","url":null,"abstract":"Random features (RFs) provide an efficient approximation to kernel methods, and allow for scalable learning on large datasets by reducing computational complexity while maintaining strong theoretical guarantees. However, real-world data can often be contaminated by outliers or heavy-tailed noise, which significantly degrades the performance of standard RF algorithms. To address this issue, we propose a robust and adaptive regularized least squares method with random features (RRLS-RF) that incorporates response truncation. The truncation level adaptively balances robustness and bias based on the sample size and moment conditions. We establish the generalization properties of RRLS-RF by assuming only a bounded $(1+delta )$-th moment for any $delta gt 0$. Specifically, our analysis shows that RRLS-RF achieves learning rates of $mathcal {O}(|D|^{-frac{delta }{2delta +2}})$ with only $mathcal {O}(|D|^{frac{delta }{2delta +2}}log |D|)$ random features, where $|D|$ denotes the training sample size. These results converge to the optimal learning rates of $mathcal {O}(|D|^{-frac{1}{2}})$ as $delta rightarrow infty$, covering the traditional boundedness or sub-Gaussian assumptions in the regularized least squares method with random features (RLS-RF). Furthermore, we refine our analysis and show that RRLS-RF can achieve even faster learning rates under source and capacity conditions, as well as a smaller number of RFs with data-dependent sampling strategies. The derived sharp learning rates can also cover the mis-specified settings where the true function may not precisely align with the assumed kernel space. We further establish the first minimax lower bound under the weak moment condition, which shows that the RRLS-RF estimator is optimal over a wide range of source conditions. Our numerical experiments and real data analysis verify the theoretical results and demonstrate the superior robustness of RRLS-RF against outliers and heavy-tailed noise compared to standard methods.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"2 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147346316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
VQA explanation task aims to explain the decision-making process of VQA models in a way that is easily understandable to humans. Existing methods mostly use visual location or natural language explanation approaches to generate corresponding rationales. Although significant progress has been made, these frameworks are bottlenecked by the following challenges: 1) Uni-modal paradigm inevitably leads to semantic ambiguity of explanations. 2) The reasoning process cannot be faithfully responded to and suffers from logical inconsistency. 3) Human-annotated explanations are expensive and time-consuming to collect. In this paper, we introduce a new Semi-supervised VQA Multi-modal Explanation (SME) method via self-critical learning, which addresses the above challenges by leveraging both visual and textual explanations to comprehensively reveal the inference process of the model. Meanwhile, in order to improve the logical consistency between answers and rationales, we design a novel self-critical strategy to evaluate candidate explanations based on answer reward scores. More importantly, our method can benefit from a tremendous amount of samples without human-annotated explanations with semi-supervised learning. Extensive automatic measures and human evaluations all show the effectiveness of our method. Finally, the framework achieves a new state-of-the-art performance on the three VQA explanation datasets. The code for this work is publicly available at https://github.com/Fake10086/MM-Explanations.
{"title":"Semi-Supervised VQA Multi-Modal Explanation via Self-Critical Learning.","authors":"Wei Suo,Ji Ma,Mengyang Sun,Hanwang Zhang,Peng Wang,Yanning Zhang,Qi Wu","doi":"10.1109/tpami.2026.3669188","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669188","url":null,"abstract":"VQA explanation task aims to explain the decision-making process of VQA models in a way that is easily understandable to humans. Existing methods mostly use visual location or natural language explanation approaches to generate corresponding rationales. Although significant progress has been made, these frameworks are bottlenecked by the following challenges: 1) Uni-modal paradigm inevitably leads to semantic ambiguity of explanations. 2) The reasoning process cannot be faithfully responded to and suffers from logical inconsistency. 3) Human-annotated explanations are expensive and time-consuming to collect. In this paper, we introduce a new Semi-supervised VQA Multi-modal Explanation (SME) method via self-critical learning, which addresses the above challenges by leveraging both visual and textual explanations to comprehensively reveal the inference process of the model. Meanwhile, in order to improve the logical consistency between answers and rationales, we design a novel self-critical strategy to evaluate candidate explanations based on answer reward scores. More importantly, our method can benefit from a tremendous amount of samples without human-annotated explanations with semi-supervised learning. Extensive automatic measures and human evaluations all show the effectiveness of our method. Finally, the framework achieves a new state-of-the-art performance on the three VQA explanation datasets. The code for this work is publicly available at https://github.com/Fake10086/MM-Explanations.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"316 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1109/tpami.2026.3669431
Tie Liu,Mai Xu,Shengxi Li,Jialu Zhang,Lai Jiang
Recent years have witnessed an explosive increase of face content, which drives a distinct shift from static images to dynamic video formats. The shift of formats inherently alters the characteristics within face videos, whereby pixel-wise artifacts are intertwined with motion-related impairments. Addressing the emerging distortions that now always appear by twins in practice, however, is challenging and non-trivial, due to the distinct characteristics in addressing spatial-temporal frequencies in videos. In this paper, we propose a novel Unified recurrent network for joint Face video quality Enhancement and Stabilization (UniFES), as the first successful attempt for both quality enhancement and motion stabilization. Correspondingly, our UniFES method proposes to effectively aggregate the mutual information in the pixel and motion domains. For the quality enhancement, our UniFES method decomposes the shaking temporal alignment problem into progressive feature alignment with explicit physical information, which includes the global dynamics from the motion domain, i.e., from the stabilization task. Regarding the video stabilization, we integrate the mixed dynamics from the enhancement task (i.e., from pixel domain) to take into account both pixel-wise and motion-related characteristics, for ensuring robust trajectory estimation and motion stabilization. Subsequently, we refine the warping masks to achieve high-quality full frame rendering. We further establish a synthetic dataset for training and evaluation regarding this emerging task. Comprehensive experiments have illustrated the superior performances of our UniFES method over 32 comparing baselines on both newly established synthetic and real-world datasets.
{"title":"UniFES: A Unified Recurrent Network for Quality Enhancement and Stabilization in Face Videos.","authors":"Tie Liu,Mai Xu,Shengxi Li,Jialu Zhang,Lai Jiang","doi":"10.1109/tpami.2026.3669431","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669431","url":null,"abstract":"Recent years have witnessed an explosive increase of face content, which drives a distinct shift from static images to dynamic video formats. The shift of formats inherently alters the characteristics within face videos, whereby pixel-wise artifacts are intertwined with motion-related impairments. Addressing the emerging distortions that now always appear by twins in practice, however, is challenging and non-trivial, due to the distinct characteristics in addressing spatial-temporal frequencies in videos. In this paper, we propose a novel Unified recurrent network for joint Face video quality Enhancement and Stabilization (UniFES), as the first successful attempt for both quality enhancement and motion stabilization. Correspondingly, our UniFES method proposes to effectively aggregate the mutual information in the pixel and motion domains. For the quality enhancement, our UniFES method decomposes the shaking temporal alignment problem into progressive feature alignment with explicit physical information, which includes the global dynamics from the motion domain, i.e., from the stabilization task. Regarding the video stabilization, we integrate the mixed dynamics from the enhancement task (i.e., from pixel domain) to take into account both pixel-wise and motion-related characteristics, for ensuring robust trajectory estimation and motion stabilization. Subsequently, we refine the warping masks to achieve high-quality full frame rendering. We further establish a synthetic dataset for training and evaluation regarding this emerging task. Comprehensive experiments have illustrated the superior performances of our UniFES method over 32 comparing baselines on both newly established synthetic and real-world datasets.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"32 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1109/tpami.2026.3669252
Lu Yu,Haiyang Zhang,Changsheng Xu
Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible t adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Local Attention Refinement Module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model's robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.
{"title":"Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness.","authors":"Lu Yu,Haiyang Zhang,Changsheng Xu","doi":"10.1109/tpami.2026.3669252","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669252","url":null,"abstract":"Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible t adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Local Attention Refinement Module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model's robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"99 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1109/tpami.2026.3664842
Fei Peng,Yang Liu,Guohui Zhou,Min Long
Face recognition models are vulnerable to spoofing of adversarial patches in the physical world. Attackers can enable face recognition models to make false identity judgments by simply pasting a sticker with a special pattern on the face. However, existing attacks lack the ability to transfer black-box models, and the improvement of the transferability is mainly focused on adversarial perturbations based the p-norm. To further improve the attack performance and transferability, a high transferable face recognition adversarial patches generation method named as AdvDiffusion is proposed. It first determines the region for adversarial patches generation based on facial gradient maps, and then an image is reconstructed to generate an adversarial patch by adding noise and denoising it with a pre-trained diffusion model. In the denoising, an adversarial loss is used to fine-tune the model and control the image to generate an adversarial patch with spoofing capability. Experiments and analysis show that the adversarial patches generated by have good adversarial attack capability on black-box face recognition models in both digital and physical domains, and also have better robustness under the changes of a complex physical environment compared with some state-of-the-art methods. It has great potential application for black-box attacks in the physical domain.
{"title":"AdvDiffusion: Adversarial Patches Generation for Face Recognition with High Transferability in Physical Domain.","authors":"Fei Peng,Yang Liu,Guohui Zhou,Min Long","doi":"10.1109/tpami.2026.3664842","DOIUrl":"https://doi.org/10.1109/tpami.2026.3664842","url":null,"abstract":"Face recognition models are vulnerable to spoofing of adversarial patches in the physical world. Attackers can enable face recognition models to make false identity judgments by simply pasting a sticker with a special pattern on the face. However, existing attacks lack the ability to transfer black-box models, and the improvement of the transferability is mainly focused on adversarial perturbations based the p-norm. To further improve the attack performance and transferability, a high transferable face recognition adversarial patches generation method named as AdvDiffusion is proposed. It first determines the region for adversarial patches generation based on facial gradient maps, and then an image is reconstructed to generate an adversarial patch by adding noise and denoising it with a pre-trained diffusion model. In the denoising, an adversarial loss is used to fine-tune the model and control the image to generate an adversarial patch with spoofing capability. Experiments and analysis show that the adversarial patches generated by have good adversarial attack capability on black-box face recognition models in both digital and physical domains, and also have better robustness under the changes of a complex physical environment compared with some state-of-the-art methods. It has great potential application for black-box attacks in the physical domain.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"7 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-25DOI: 10.1109/tpami.2026.3667935
Baoquan Zhang, Guotao Liang, Tianran Chen, Yunming Ye, Zhiyuan Wen, Xiaochen Qi, Yao He
{"title":"Codebook Transfer With Vision-to-Language Translation for Vector Quantization","authors":"Baoquan Zhang, Guotao Liang, Tianran Chen, Yunming Ye, Zhiyuan Wen, Xiaochen Qi, Yao He","doi":"10.1109/tpami.2026.3667935","DOIUrl":"https://doi.org/10.1109/tpami.2026.3667935","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"17 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147287603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}