首页 > 最新文献

IEEE Transactions on Pattern Analysis and Machine Intelligence最新文献

英文 中文
Causal Prompts for Open-vocabulary Video Instance Segmentation. 开放词汇视频实例分割的因果提示。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-03 DOI: 10.1109/tpami.2026.3669976
Rongkun Zheng,Lu Qi,Xi Chen,Yi Wang,Kun Wang,Yu Qiao,Hengshuang Zhao
Open-vocabulary Video Instance Segmentation addresses the challenging task of detecting, segmenting, and tracking objects in videos, including categories not encountered during training. However, existing approaches often overlook rich temporal cues from preceding frames, limiting their ability to leverage causal context for robust open-world generalization. To bridge this gap, we propose CPOVIS, a novel framework that introduces causal prompts-dynamically propagated visual and taxonomy prompts from historical frames-to enhance temporal reasoning and semantic consistency. Built upon a Mask2Former architecture with a CLIP backbone, CPOVIS integrates three core innovations: (1) PromptCLIP, which aligns cross-modal embeddings while preserving open-vocabulary capabilities; (2) a Visual Prompt Injector that propagates object-level features to maintain spatial-temporal coherence; and (3) a Taxonomy Prompt Infuser that leverages hierarchical semantic relationships to stabilize unseen category recognition. Furthermore, we introduce a contrastive learning strategy to disentangle object representations across frames and adapt the Segment Anything Model (SAM2) to boost open-vocabulary segmentation and tracking capacity in open-vocabulary video scenarios. Extensive experiments on seven challenging open- and closed-vocabulary video segmentation benchmarks demonstrate CPOVIS's state-of-the-art performance, outperforming existing methods by significant margins. Our findings highlight the critical role of causal prompt propagation in advancing video understanding in open-world scenarios.
开放词汇视频实例分割解决了检测、分割和跟踪视频对象的挑战性任务,包括训练期间未遇到的类别。然而,现有的方法往往忽略了来自之前框架的丰富的时间线索,限制了它们利用因果背景进行稳健的开放世界泛化的能力。为了弥补这一差距,我们提出了CPOVIS,这是一个引入因果提示的新框架——从历史框架中动态传播的视觉和分类提示——以增强时间推理和语义一致性。CPOVIS基于Mask2Former架构和CLIP主干,集成了三个核心创新:(1)PromptCLIP,它在保持开放词汇能力的同时对齐跨模态嵌入;(2)传播对象级特征以保持时空一致性的视觉提示注入器;(3)利用层次语义关系稳定未知类别识别的分类提示Infuser。此外,我们引入了一种对比学习策略来分解跨帧的对象表示,并采用分段任意模型(SAM2)来提高开放词汇视频场景下的开放词汇分割和跟踪能力。在七个具有挑战性的开放和封闭词汇视频分割基准上进行的大量实验表明,CPOVIS的最先进性能明显优于现有方法。我们的研究结果强调了因果提示传播在推进开放世界场景下的视频理解方面的关键作用。
{"title":"Causal Prompts for Open-vocabulary Video Instance Segmentation.","authors":"Rongkun Zheng,Lu Qi,Xi Chen,Yi Wang,Kun Wang,Yu Qiao,Hengshuang Zhao","doi":"10.1109/tpami.2026.3669976","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669976","url":null,"abstract":"Open-vocabulary Video Instance Segmentation addresses the challenging task of detecting, segmenting, and tracking objects in videos, including categories not encountered during training. However, existing approaches often overlook rich temporal cues from preceding frames, limiting their ability to leverage causal context for robust open-world generalization. To bridge this gap, we propose CPOVIS, a novel framework that introduces causal prompts-dynamically propagated visual and taxonomy prompts from historical frames-to enhance temporal reasoning and semantic consistency. Built upon a Mask2Former architecture with a CLIP backbone, CPOVIS integrates three core innovations: (1) PromptCLIP, which aligns cross-modal embeddings while preserving open-vocabulary capabilities; (2) a Visual Prompt Injector that propagates object-level features to maintain spatial-temporal coherence; and (3) a Taxonomy Prompt Infuser that leverages hierarchical semantic relationships to stabilize unseen category recognition. Furthermore, we introduce a contrastive learning strategy to disentangle object representations across frames and adapt the Segment Anything Model (SAM2) to boost open-vocabulary segmentation and tracking capacity in open-vocabulary video scenarios. Extensive experiments on seven challenging open- and closed-vocabulary video segmentation benchmarks demonstrate CPOVIS's state-of-the-art performance, outperforming existing methods by significant margins. Our findings highlight the critical role of causal prompt propagation in advancing video understanding in open-world scenarios.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"53 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147346351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supervised Small-baseline and Large-baseline Homography Learning with Diffusion-based Data Generation. 基于扩散的数据生成的监督小基线和大基线单应性学习。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-03 DOI: 10.1109/tpami.2026.3669995
Hai Jiang,Haipeng Li,Songchen Han,Bing Zeng,Shuaicheng Liu
In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data for supervised small-baseline and large-baseline homography learning and yield a state-of-the-art homography estimation network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content refinement diffusion model. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method outperforms existing competitors and previous supervised methods can also be improved based on the generated dataset. The code and dataset are available at https://github.com/JianghaiSCU/RealSH.
在本文中,我们提出了一个迭代框架,该框架由两个阶段组成:生成阶段和训练阶段,以生成有监督的小基线和大基线单应性学习的真实训练数据,并产生最先进的单应性估计网络。在生成阶段,给定一个未标记的图像对,我们利用预估计的优势平面掩模和该图像对的单应性,以及另一个采样的单应性作为基础真值,以生成具有真实运动的新标记训练对。在训练阶段,生成的数据用于训练监督单应性网络,其中训练数据通过内容细化扩散模型进行细化。一旦迭代完成,训练好的网络将在下一个数据生成阶段更新预估的单应性。通过这种迭代策略,可以逐步提高数据集的质量,同时提高网络的性能。实验结果表明,我们的方法优于现有的竞争对手,并且基于生成的数据集也可以改进以前的监督方法。代码和数据集可从https://github.com/JianghaiSCU/RealSH获得。
{"title":"Supervised Small-baseline and Large-baseline Homography Learning with Diffusion-based Data Generation.","authors":"Hai Jiang,Haipeng Li,Songchen Han,Bing Zeng,Shuaicheng Liu","doi":"10.1109/tpami.2026.3669995","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669995","url":null,"abstract":"In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data for supervised small-baseline and large-baseline homography learning and yield a state-of-the-art homography estimation network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content refinement diffusion model. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method outperforms existing competitors and previous supervised methods can also be improved based on the generated dataset. The code and dataset are available at https://github.com/JianghaiSCU/RealSH.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"43 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147346352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalization Properties of Robust Learning With Random Features. 随机特征鲁棒学习的泛化性质。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-03 DOI: 10.1109/tpami.2026.3669168
Caixing Wang
Random features (RFs) provide an efficient approximation to kernel methods, and allow for scalable learning on large datasets by reducing computational complexity while maintaining strong theoretical guarantees. However, real-world data can often be contaminated by outliers or heavy-tailed noise, which significantly degrades the performance of standard RF algorithms. To address this issue, we propose a robust and adaptive regularized least squares method with random features (RRLS-RF) that incorporates response truncation. The truncation level adaptively balances robustness and bias based on the sample size and moment conditions. We establish the generalization properties of RRLS-RF by assuming only a bounded $(1+delta )$-th moment for any $delta gt 0$. Specifically, our analysis shows that RRLS-RF achieves learning rates of $mathcal {O}(|D|^{-frac{delta }{2delta +2}})$ with only $mathcal {O}(|D|^{frac{delta }{2delta +2}}log |D|)$ random features, where $|D|$ denotes the training sample size. These results converge to the optimal learning rates of $mathcal {O}(|D|^{-frac{1}{2}})$ as $delta rightarrow infty$, covering the traditional boundedness or sub-Gaussian assumptions in the regularized least squares method with random features (RLS-RF). Furthermore, we refine our analysis and show that RRLS-RF can achieve even faster learning rates under source and capacity conditions, as well as a smaller number of RFs with data-dependent sampling strategies. The derived sharp learning rates can also cover the mis-specified settings where the true function may not precisely align with the assumed kernel space. We further establish the first minimax lower bound under the weak moment condition, which shows that the RRLS-RF estimator is optimal over a wide range of source conditions. Our numerical experiments and real data analysis verify the theoretical results and demonstrate the superior robustness of RRLS-RF against outliers and heavy-tailed noise compared to standard methods.
随机特征(rf)提供了对核方法的有效近似,并通过降低计算复杂性同时保持强大的理论保证,允许在大型数据集上进行可扩展的学习。然而,现实世界的数据经常会受到异常值或重尾噪声的污染,这大大降低了标准射频算法的性能。为了解决这一问题,我们提出了一种包含响应截断的鲁棒自适应随机特征正则化最小二乘法(RRLS-RF)。截断水平自适应平衡鲁棒性和偏差基于样本量和矩条件。通过对任意$delta gt 0$只假设有界$(1+delta )$ -矩,建立了RRLS-RF的泛化性质。具体来说,我们的分析表明,RRLS-RF仅使用$mathcal {O}(|D|^{frac{delta }{2delta +2}}log |D|)$随机特征即可实现$mathcal {O}(|D|^{-frac{delta }{2delta +2}})$的学习率,其中$|D|$表示训练样本量。这些结果收敛到$mathcal {O}(|D|^{-frac{1}{2}})$为$delta rightarrow infty$的最优学习率,涵盖了随机特征正则化最小二乘法(RLS-RF)中传统的有界性或亚高斯假设。此外,我们改进了我们的分析,并表明RRLS-RF在源和容量条件下可以实现更快的学习率,并且具有数据依赖采样策略的RFs数量更少。导出的急剧学习率还可以涵盖错误指定的设置,其中真实函数可能与假设的核空间不精确地对齐。我们进一步建立了弱矩条件下的第一极小极大下界,这表明RRLS-RF估计器在广泛的源条件下是最优的。我们的数值实验和实际数据分析验证了理论结果,并证明了与标准方法相比,RRLS-RF对异常值和重尾噪声具有更好的鲁棒性。
{"title":"Generalization Properties of Robust Learning With Random Features.","authors":"Caixing Wang","doi":"10.1109/tpami.2026.3669168","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669168","url":null,"abstract":"Random features (RFs) provide an efficient approximation to kernel methods, and allow for scalable learning on large datasets by reducing computational complexity while maintaining strong theoretical guarantees. However, real-world data can often be contaminated by outliers or heavy-tailed noise, which significantly degrades the performance of standard RF algorithms. To address this issue, we propose a robust and adaptive regularized least squares method with random features (RRLS-RF) that incorporates response truncation. The truncation level adaptively balances robustness and bias based on the sample size and moment conditions. We establish the generalization properties of RRLS-RF by assuming only a bounded $(1+delta )$-th moment for any $delta gt 0$. Specifically, our analysis shows that RRLS-RF achieves learning rates of $mathcal {O}(|D|^{-frac{delta }{2delta +2}})$ with only $mathcal {O}(|D|^{frac{delta }{2delta +2}}log |D|)$ random features, where $|D|$ denotes the training sample size. These results converge to the optimal learning rates of $mathcal {O}(|D|^{-frac{1}{2}})$ as $delta rightarrow infty$, covering the traditional boundedness or sub-Gaussian assumptions in the regularized least squares method with random features (RLS-RF). Furthermore, we refine our analysis and show that RRLS-RF can achieve even faster learning rates under source and capacity conditions, as well as a smaller number of RFs with data-dependent sampling strategies. The derived sharp learning rates can also cover the mis-specified settings where the true function may not precisely align with the assumed kernel space. We further establish the first minimax lower bound under the weak moment condition, which shows that the RRLS-RF estimator is optimal over a wide range of source conditions. Our numerical experiments and real data analysis verify the theoretical results and demonstrate the superior robustness of RRLS-RF against outliers and heavy-tailed noise compared to standard methods.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"2 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147346316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semi-Supervised VQA Multi-Modal Explanation via Self-Critical Learning. 基于自我批判学习的半监督VQA多模态解释。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-02 DOI: 10.1109/tpami.2026.3669188
Wei Suo,Ji Ma,Mengyang Sun,Hanwang Zhang,Peng Wang,Yanning Zhang,Qi Wu
VQA explanation task aims to explain the decision-making process of VQA models in a way that is easily understandable to humans. Existing methods mostly use visual location or natural language explanation approaches to generate corresponding rationales. Although significant progress has been made, these frameworks are bottlenecked by the following challenges: 1) Uni-modal paradigm inevitably leads to semantic ambiguity of explanations. 2) The reasoning process cannot be faithfully responded to and suffers from logical inconsistency. 3) Human-annotated explanations are expensive and time-consuming to collect. In this paper, we introduce a new Semi-supervised VQA Multi-modal Explanation (SME) method via self-critical learning, which addresses the above challenges by leveraging both visual and textual explanations to comprehensively reveal the inference process of the model. Meanwhile, in order to improve the logical consistency between answers and rationales, we design a novel self-critical strategy to evaluate candidate explanations based on answer reward scores. More importantly, our method can benefit from a tremendous amount of samples without human-annotated explanations with semi-supervised learning. Extensive automatic measures and human evaluations all show the effectiveness of our method. Finally, the framework achieves a new state-of-the-art performance on the three VQA explanation datasets. The code for this work is publicly available at https://github.com/Fake10086/MM-Explanations.
VQA解释任务旨在以人类容易理解的方式解释VQA模型的决策过程。现有的方法大多采用视觉定位或自然语言解释的方法来生成相应的理据。尽管这些框架已经取得了很大的进展,但仍受到以下挑战的瓶颈:1)单模态范式不可避免地导致解释的语义歧义。2)推理过程不能得到忠实的回应,存在逻辑不一致的问题。3)人工注释的解释既昂贵又耗时。在本文中,我们通过自我批判学习引入了一种新的半监督VQA多模态解释(SME)方法,该方法通过利用视觉和文本解释来全面揭示模型的推理过程,从而解决了上述挑战。同时,为了提高答案与基本原理之间的逻辑一致性,我们设计了一种基于答案奖励分数的自我批评策略来评估候选解释。更重要的是,我们的方法可以从大量的样本中受益,而无需人工注释的解释和半监督学习。广泛的自动测量和人工评估都表明了我们方法的有效性。最后,该框架在三个VQA解释数据集上实现了最新的性能。这项工作的代码可在https://github.com/Fake10086/MM-Explanations上公开获得。
{"title":"Semi-Supervised VQA Multi-Modal Explanation via Self-Critical Learning.","authors":"Wei Suo,Ji Ma,Mengyang Sun,Hanwang Zhang,Peng Wang,Yanning Zhang,Qi Wu","doi":"10.1109/tpami.2026.3669188","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669188","url":null,"abstract":"VQA explanation task aims to explain the decision-making process of VQA models in a way that is easily understandable to humans. Existing methods mostly use visual location or natural language explanation approaches to generate corresponding rationales. Although significant progress has been made, these frameworks are bottlenecked by the following challenges: 1) Uni-modal paradigm inevitably leads to semantic ambiguity of explanations. 2) The reasoning process cannot be faithfully responded to and suffers from logical inconsistency. 3) Human-annotated explanations are expensive and time-consuming to collect. In this paper, we introduce a new Semi-supervised VQA Multi-modal Explanation (SME) method via self-critical learning, which addresses the above challenges by leveraging both visual and textual explanations to comprehensively reveal the inference process of the model. Meanwhile, in order to improve the logical consistency between answers and rationales, we design a novel self-critical strategy to evaluate candidate explanations based on answer reward scores. More importantly, our method can benefit from a tremendous amount of samples without human-annotated explanations with semi-supervised learning. Extensive automatic measures and human evaluations all show the effectiveness of our method. Finally, the framework achieves a new state-of-the-art performance on the three VQA explanation datasets. The code for this work is publicly available at https://github.com/Fake10086/MM-Explanations.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"316 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UniFES: A Unified Recurrent Network for Quality Enhancement and Stabilization in Face Videos. unity:用于人脸视频质量增强和稳定的统一循环网络。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-02 DOI: 10.1109/tpami.2026.3669431
Tie Liu,Mai Xu,Shengxi Li,Jialu Zhang,Lai Jiang
Recent years have witnessed an explosive increase of face content, which drives a distinct shift from static images to dynamic video formats. The shift of formats inherently alters the characteristics within face videos, whereby pixel-wise artifacts are intertwined with motion-related impairments. Addressing the emerging distortions that now always appear by twins in practice, however, is challenging and non-trivial, due to the distinct characteristics in addressing spatial-temporal frequencies in videos. In this paper, we propose a novel Unified recurrent network for joint Face video quality Enhancement and Stabilization (UniFES), as the first successful attempt for both quality enhancement and motion stabilization. Correspondingly, our UniFES method proposes to effectively aggregate the mutual information in the pixel and motion domains. For the quality enhancement, our UniFES method decomposes the shaking temporal alignment problem into progressive feature alignment with explicit physical information, which includes the global dynamics from the motion domain, i.e., from the stabilization task. Regarding the video stabilization, we integrate the mixed dynamics from the enhancement task (i.e., from pixel domain) to take into account both pixel-wise and motion-related characteristics, for ensuring robust trajectory estimation and motion stabilization. Subsequently, we refine the warping masks to achieve high-quality full frame rendering. We further establish a synthetic dataset for training and evaluation regarding this emerging task. Comprehensive experiments have illustrated the superior performances of our UniFES method over 32 comparing baselines on both newly established synthetic and real-world datasets.
近年来,人脸内容呈爆炸式增长,这推动了从静态图像到动态视频格式的明显转变。格式的转变本质上改变了人脸视频的特征,因此像素方面的伪影与运动相关的损伤交织在一起。然而,由于在视频中处理时空频率的不同特征,解决双胞胎在实践中经常出现的新出现的扭曲是具有挑战性和重要的。在本文中,我们提出了一种新的用于联合人脸视频质量增强和稳定的统一循环网络(UniFES),作为质量增强和运动稳定的第一次成功尝试。相应的,我们的UniFES方法提出了在像素域和运动域有效聚合互信息的方法。为了提高质量,我们的UniFES方法将振动时序对齐问题分解为具有明确物理信息的渐进式特征对齐,其中包括来自运动域的全局动力学,即来自稳定任务的全局动力学。在视频稳定方面,我们整合了来自增强任务(即来自像素域)的混合动力学,以考虑像素和运动相关特性,以确保鲁棒轨迹估计和运动稳定。随后,我们细化扭曲蒙版,以实现高质量的全帧渲染。我们进一步建立了一个合成数据集,用于训练和评估这一新兴任务。综合实验表明,在新建立的合成数据集和实际数据集上,我们的UniFES方法在32个比较基线上具有优越的性能。
{"title":"UniFES: A Unified Recurrent Network for Quality Enhancement and Stabilization in Face Videos.","authors":"Tie Liu,Mai Xu,Shengxi Li,Jialu Zhang,Lai Jiang","doi":"10.1109/tpami.2026.3669431","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669431","url":null,"abstract":"Recent years have witnessed an explosive increase of face content, which drives a distinct shift from static images to dynamic video formats. The shift of formats inherently alters the characteristics within face videos, whereby pixel-wise artifacts are intertwined with motion-related impairments. Addressing the emerging distortions that now always appear by twins in practice, however, is challenging and non-trivial, due to the distinct characteristics in addressing spatial-temporal frequencies in videos. In this paper, we propose a novel Unified recurrent network for joint Face video quality Enhancement and Stabilization (UniFES), as the first successful attempt for both quality enhancement and motion stabilization. Correspondingly, our UniFES method proposes to effectively aggregate the mutual information in the pixel and motion domains. For the quality enhancement, our UniFES method decomposes the shaking temporal alignment problem into progressive feature alignment with explicit physical information, which includes the global dynamics from the motion domain, i.e., from the stabilization task. Regarding the video stabilization, we integrate the mixed dynamics from the enhancement task (i.e., from pixel domain) to take into account both pixel-wise and motion-related characteristics, for ensuring robust trajectory estimation and motion stabilization. Subsequently, we refine the warping masks to achieve high-quality full frame rendering. We further establish a synthetic dataset for training and evaluation regarding this emerging task. Comprehensive experiments have illustrated the superior performances of our UniFES method over 32 comparing baselines on both newly established synthetic and real-world datasets.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"32 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness. 零射击对抗鲁棒性的互补文本引导注意。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-02 DOI: 10.1109/tpami.2026.3669252
Lu Yu,Haiyang Zhang,Changsheng Xu
Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible t adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Local Attention Refinement Module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model's robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.
由于令人印象深刻的零射击能力,预训练的视觉语言模型(例如CLIP)已经在各个领域引起了广泛的关注和采用。尽管如此,CLIP已被观察到容易受到对抗性例子的影响。通过实验分析,我们观察到一种现象,其中对抗性扰动诱导文本引导注意力的转移。基于这一观察,我们提出了一个简单而有效的策略:文本引导注意零射击鲁棒性(TGA-ZSR)。该框架由两个部分组成:局部注意力精炼模块和全局注意力约束模块。我们的目标是保持CLIP模型的泛化性并增强其对抗性鲁棒性:局部注意力细化模块将通过对抗性示例从目标模型获得的文本引导注意力与通过干净示例从原始模型获得的文本引导注意力进行对齐。这种一致性增强了模型的健壮性。此外,全局注意力约束模块使用干净的示例从目标模型和原始模型获取文本引导的注意力。其目标是保持模型在干净样本上的性能,同时增强整体稳健性。然而,我们观察到该方法偶尔会关注不相关或虚假的特征,这可能导致次优性能并在某些情况下破坏其鲁棒性。为了克服这一限制,我们进一步提出了一种新的方法,称为互补文本引导注意(compp - tga)。该方法整合了两种类型的前景注意:由类提示引导的注意和由非类提示驱动的反向注意。这些互补的注意机制使模型能够捕获更全面和准确的前景表示。实验验证了TGA-ZSR和Comp-TGA在16个数据集上的零射击鲁棒精度分别比目前最先进的技术提高了9.58%和11.95%。
{"title":"Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness.","authors":"Lu Yu,Haiyang Zhang,Changsheng Xu","doi":"10.1109/tpami.2026.3669252","DOIUrl":"https://doi.org/10.1109/tpami.2026.3669252","url":null,"abstract":"Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible t adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Local Attention Refinement Module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model's robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"99 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AdvDiffusion: Adversarial Patches Generation for Face Recognition with High Transferability in Physical Domain. AdvDiffusion:在物理域具有高可转移性的人脸识别的对抗补丁生成。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-02 DOI: 10.1109/tpami.2026.3664842
Fei Peng,Yang Liu,Guohui Zhou,Min Long
Face recognition models are vulnerable to spoofing of adversarial patches in the physical world. Attackers can enable face recognition models to make false identity judgments by simply pasting a sticker with a special pattern on the face. However, existing attacks lack the ability to transfer black-box models, and the improvement of the transferability is mainly focused on adversarial perturbations based the p-norm. To further improve the attack performance and transferability, a high transferable face recognition adversarial patches generation method named as AdvDiffusion is proposed. It first determines the region for adversarial patches generation based on facial gradient maps, and then an image is reconstructed to generate an adversarial patch by adding noise and denoising it with a pre-trained diffusion model. In the denoising, an adversarial loss is used to fine-tune the model and control the image to generate an adversarial patch with spoofing capability. Experiments and analysis show that the adversarial patches generated by have good adversarial attack capability on black-box face recognition models in both digital and physical domains, and also have better robustness under the changes of a complex physical environment compared with some state-of-the-art methods. It has great potential application for black-box attacks in the physical domain.
人脸识别模型容易受到物理世界中对抗性补丁的欺骗。攻击者只需在人脸上贴上带有特殊图案的贴纸,就可以使人脸识别模型做出错误的身份判断。然而,现有的攻击缺乏黑箱模型的转移能力,可转移性的提高主要集中在基于p-范数的对抗性扰动上。为了进一步提高攻击性能和可移植性,提出了一种高可移植性的人脸识别对抗补丁生成方法——AdvDiffusion。该算法首先基于人脸梯度图确定生成对抗patch的区域,然后通过加入噪声和预训练的扩散模型去噪,重构图像生成对抗patch。在去噪中,利用对抗损失对模型进行微调并控制图像以生成具有欺骗能力的对抗patch。实验和分析表明,所生成的对抗补丁在数字和物理领域对黑箱人脸识别模型都具有良好的对抗攻击能力,并且在复杂物理环境变化下具有较好的鲁棒性。它在物理领域的黑盒攻击中具有很大的应用潜力。
{"title":"AdvDiffusion: Adversarial Patches Generation for Face Recognition with High Transferability in Physical Domain.","authors":"Fei Peng,Yang Liu,Guohui Zhou,Min Long","doi":"10.1109/tpami.2026.3664842","DOIUrl":"https://doi.org/10.1109/tpami.2026.3664842","url":null,"abstract":"Face recognition models are vulnerable to spoofing of adversarial patches in the physical world. Attackers can enable face recognition models to make false identity judgments by simply pasting a sticker with a special pattern on the face. However, existing attacks lack the ability to transfer black-box models, and the improvement of the transferability is mainly focused on adversarial perturbations based the p-norm. To further improve the attack performance and transferability, a high transferable face recognition adversarial patches generation method named as AdvDiffusion is proposed. It first determines the region for adversarial patches generation based on facial gradient maps, and then an image is reconstructed to generate an adversarial patch by adding noise and denoising it with a pre-trained diffusion model. In the denoising, an adversarial loss is used to fine-tune the model and control the image to generate an adversarial patch with spoofing capability. Experiments and analysis show that the adversarial patches generated by have good adversarial attack capability on black-box face recognition models in both digital and physical domains, and also have better robustness under the changes of a complex physical environment compared with some state-of-the-art methods. It has great potential application for black-box attacks in the physical domain.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"7 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Codebook Transfer With Vision-to-Language Translation for Vector Quantization 基于视觉到语言翻译的矢量量化码本传输
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-25 DOI: 10.1109/tpami.2026.3667935
Baoquan Zhang, Guotao Liang, Tianran Chen, Yunming Ye, Zhiyuan Wen, Xiaochen Qi, Yao He
{"title":"Codebook Transfer With Vision-to-Language Translation for Vector Quantization","authors":"Baoquan Zhang, Guotao Liang, Tianran Chen, Yunming Ye, Zhiyuan Wen, Xiaochen Qi, Yao He","doi":"10.1109/tpami.2026.3667935","DOIUrl":"https://doi.org/10.1109/tpami.2026.3667935","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"17 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147287603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization winor - cam:通过分层Winsorization从深度网络中获得人类可调的视觉解释
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-25 DOI: 10.1109/tpami.2026.3668075
Casey Wall, Longwei Wang, Rodrigue Rizk, KC Santosh
{"title":"Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization","authors":"Casey Wall, Longwei Wang, Rodrigue Rizk, KC Santosh","doi":"10.1109/tpami.2026.3668075","DOIUrl":"https://doi.org/10.1109/tpami.2026.3668075","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"19 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147287582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deployment Prior Injection for Run-time Re-biasable Object Detection 用于运行时可重置对象检测的部署先验注入
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-25 DOI: 10.1109/tpami.2026.3667914
Mo Zhou, Yiding Yang, Haoxiang Li, Vishal M. Patel, Gang Hua
{"title":"Deployment Prior Injection for Run-time Re-biasable Object Detection","authors":"Mo Zhou, Yiding Yang, Haoxiang Li, Vishal M. Patel, Gang Hua","doi":"10.1109/tpami.2026.3667914","DOIUrl":"https://doi.org/10.1109/tpami.2026.3667914","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"19 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147287581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Pattern Analysis and Machine Intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1