首页 > 最新文献

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文 中文
Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation 基于非自回归模型的轨迹预测与估算
Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.01275
Mengshi Qi, Jie Qin, Yu Wu, Yi Yang
Trajectory forecasting and imputation are pivotal steps towards understanding the movement of human and objects, which are quite challenging since the future trajectories and missing values in a temporal sequence are full of uncertainties, and the spatial-temporally contextual correlation is hard to model. Yet, the relevance between sequence prediction and imputation is disregarded by existing approaches. To this end, we propose a novel imitative non-autoregressive modeling method to simultaneously handle the trajectory prediction task and the missing value imputation task. Specifically, our framework adopts an imitation learning paradigm, which contains a recurrent conditional variational autoencoder (RC-VAE) as a demonstrator, and a non-autoregressive transformation model (NART) as a learner. By jointly optimizing the two models, RC-VAE can predict the future trajectory and capture the temporal relationship in the sequence to supervise the NART learner. As a result, NART learns from the demonstrator and imputes the missing value in a non autoregressive strategy. We conduct extensive experiments on three popular datasets, and the results show that our model achieves state-of-the-art performance across all the datasets.
轨迹预测和归因是理解人和物体运动的关键步骤,但由于未来轨迹和缺失值在时间序列中充满不确定性,且时空背景相关性难以建模,因此具有相当大的挑战性。然而,现有的方法忽略了序列预测与imputation之间的相关性。为此,我们提出了一种新的模拟非自回归建模方法来同时处理轨迹预测任务和缺失值输入任务。具体来说,我们的框架采用了一种模仿学习范式,其中包含一个循环条件变分自编码器(RC-VAE)作为演示器,一个非自回归转换模型(NART)作为学习者。通过联合优化两个模型,RC-VAE可以预测未来的轨迹并捕捉序列中的时间关系,从而监督NART学习者。因此,NART从演示器中学习,并以非自回归策略归因缺失值。我们在三个流行的数据集上进行了广泛的实验,结果表明我们的模型在所有数据集上都达到了最先进的性能。
{"title":"Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation","authors":"Mengshi Qi, Jie Qin, Yu Wu, Yi Yang","doi":"10.1109/CVPR42600.2020.01275","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01275","url":null,"abstract":"Trajectory forecasting and imputation are pivotal steps towards understanding the movement of human and objects, which are quite challenging since the future trajectories and missing values in a temporal sequence are full of uncertainties, and the spatial-temporally contextual correlation is hard to model. Yet, the relevance between sequence prediction and imputation is disregarded by existing approaches. To this end, we propose a novel imitative non-autoregressive modeling method to simultaneously handle the trajectory prediction task and the missing value imputation task. Specifically, our framework adopts an imitation learning paradigm, which contains a recurrent conditional variational autoencoder (RC-VAE) as a demonstrator, and a non-autoregressive transformation model (NART) as a learner. By jointly optimizing the two models, RC-VAE can predict the future trajectory and capture the temporal relationship in the sequence to supervise the NART learner. As a result, NART learns from the demonstrator and imputes the missing value in a non autoregressive strategy. We conduct extensive experiments on three popular datasets, and the results show that our model achieves state-of-the-art performance across all the datasets.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"48 1","pages":"12733-12742"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82720180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
WCP: Worst-Case Perturbations for Semi-Supervised Deep Learning 半监督深度学习的最坏情况摄动
Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.00397
Liheng Zhang, Guo-Jun Qi
In this paper, we present a novel regularization mechanism for training deep networks by minimizing the {em Worse-Case Perturbation} (WCP). It is based on the idea that a robust model is least likely to be affected by small perturbations, such that its output decisions should be as stable as possible on both labeled and unlabeled examples. We will consider two forms of WCP regularizations -- additive and DropConnect perturbations, which impose additive noises on network weights, and make structural changes by dropping the network connections, respectively. We will show that the worse cases of both perturbations can be derived by solving respective optimization problems with spectral methods. The WCP can be minimized on both labeled and unlabeled data so that networks can be trained in a semi-supervised fashion. This leads to a novel paradigm of semi-supervised classifiers by stabilizing the predicted outputs in presence of the worse-case perturbations imposed on the network weights and structures.
在本文中,我们提出了一种新的正则化机制,通过最小化{em最坏情况摄动}(WCP)来训练深度网络。它基于鲁棒模型最不可能受到小扰动影响的想法,因此它的输出决策在标记和未标记的示例上都应该尽可能稳定。我们将考虑两种形式的WCP正则化——加性和DropConnect扰动,它们分别对网络权重施加加性噪声,并通过放弃网络连接来进行结构改变。我们将证明,通过用谱方法求解各自的优化问题,可以推导出这两种扰动的最坏情况。WCP可以在标记和未标记的数据上最小化,这样网络就可以以半监督的方式进行训练。这导致了一种新的半监督分类器范例,通过在网络权重和结构上施加的最坏情况扰动存在的情况下稳定预测输出。
{"title":"WCP: Worst-Case Perturbations for Semi-Supervised Deep Learning","authors":"Liheng Zhang, Guo-Jun Qi","doi":"10.1109/CVPR42600.2020.00397","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00397","url":null,"abstract":"In this paper, we present a novel regularization mechanism for training deep networks by minimizing the {em Worse-Case Perturbation} (WCP). It is based on the idea that a robust model is least likely to be affected by small perturbations, such that its output decisions should be as stable as possible on both labeled and unlabeled examples. We will consider two forms of WCP regularizations -- additive and DropConnect perturbations, which impose additive noises on network weights, and make structural changes by dropping the network connections, respectively. We will show that the worse cases of both perturbations can be derived by solving respective optimization problems with spectral methods. The WCP can be minimized on both labeled and unlabeled data so that networks can be trained in a semi-supervised fashion. This leads to a novel paradigm of semi-supervised classifiers by stabilizing the predicted outputs in presence of the worse-case perturbations imposed on the network weights and structures.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"33 7 1","pages":"3911-3920"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82775457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Deep Kinematics Analysis for Monocular 3D Human Pose Estimation 单目三维人体姿态估计的深度运动学分析
Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00098
Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, Wenjun Zhang
For monocular 3D pose estimation conditioned on 2D detection, noisy/unreliable input is a key obstacle in this task. Simple structure constraints attempting to tackle this problem, e.g., symmetry loss and joint angle limit, could only provide marginal improvements and are commonly treated as auxiliary losses in previous researches. Thus it still remains challenging about how to effectively utilize the power of human prior knowledge for this task. In this paper, we propose to address above issue in a systematic view. Firstly, we show that optimizing the kinematics structure of noisy 2D inputs is critical to obtain accurate 3D estimations. Secondly, based on corrected 2D joints, we further explicitly decompose articulated motion with human topology, which leads to more compact 3D static structure easier for estimation. Finally, temporal refinement emphasizing the validity of 3D dynamic structure is naturally developed to pursue more accurate result. Above three steps are seamlessly integrated into deep neural models, which form a deep kinematics analysis pipeline concurrently considering the static/dynamic structure of 2D inputs and 3D outputs. Extensive experiments show that proposed framework achieves state-of-the-art performance on two widely used 3D human action datasets. Meanwhile, targeted ablation study shows that each former step is critical for the latter one to obtain promising results.
对于以二维检测为条件的单眼三维姿态估计,噪声/不可靠的输入是该任务的主要障碍。试图解决这一问题的简单结构约束,如对称损失和关节角度极限,只能提供边际改进,在以往的研究中通常被视为辅助损失。因此,如何有效地利用人类先验知识的力量来完成这一任务仍然是一个挑战。在本文中,我们建议从系统的角度来解决上述问题。首先,我们证明了优化有噪声的二维输入的运动学结构对于获得准确的三维估计是至关重要的。其次,在校正后的二维关节基础上,进一步将关节运动与人体拓扑进行显式分解,使三维静态结构更加紧凑,便于估计;最后,为了追求更精确的结果,自然会发展出强调三维动态结构有效性的时间细化。以上三个步骤无缝集成到深度神经模型中,形成一个深度运动学分析管道,同时考虑二维输入和三维输出的静态/动态结构。大量的实验表明,所提出的框架在两个广泛使用的三维人体动作数据集上达到了最先进的性能。同时,靶烧蚀研究表明,前一个步骤对于后一个步骤取得良好效果至关重要。
{"title":"Deep Kinematics Analysis for Monocular 3D Human Pose Estimation","authors":"Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, Wenjun Zhang","doi":"10.1109/cvpr42600.2020.00098","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00098","url":null,"abstract":"For monocular 3D pose estimation conditioned on 2D detection, noisy/unreliable input is a key obstacle in this task. Simple structure constraints attempting to tackle this problem, e.g., symmetry loss and joint angle limit, could only provide marginal improvements and are commonly treated as auxiliary losses in previous researches. Thus it still remains challenging about how to effectively utilize the power of human prior knowledge for this task. In this paper, we propose to address above issue in a systematic view. Firstly, we show that optimizing the kinematics structure of noisy 2D inputs is critical to obtain accurate 3D estimations. Secondly, based on corrected 2D joints, we further explicitly decompose articulated motion with human topology, which leads to more compact 3D static structure easier for estimation. Finally, temporal refinement emphasizing the validity of 3D dynamic structure is naturally developed to pursue more accurate result. Above three steps are seamlessly integrated into deep neural models, which form a deep kinematics analysis pipeline concurrently considering the static/dynamic structure of 2D inputs and 3D outputs. Extensive experiments show that proposed framework achieves state-of-the-art performance on two widely used 3D human action datasets. Meanwhile, targeted ablation study shows that each former step is critical for the latter one to obtain promising results.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"15 1","pages":"896-905"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89031789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 120
Structure Boundary Preserving Segmentation for Medical Image With Ambiguous Boundary 边界模糊医学图像的结构边界保持分割
Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.00487
Hong Joo Lee, Jung Uk Kim, Sangmin Lee, Hak Gu Kim, Yong Man Ro
In this paper, we propose a novel image segmentation method to tackle two critical problems of medical image, which are (i) ambiguity of structure boundary in the medical image domain and (ii) uncertainty of the segmented region without specialized domain knowledge. To solve those two problems in automatic medical segmentation, we propose a novel structure boundary preserving segmentation framework. To this end, the boundary key point selection algorithm is proposed. In the proposed algorithm, the key points on the structural boundary of the target object are estimated. Then, a boundary preserving block (BPB) with the boundary key point map is applied for predicting the structure boundary of the target object. Further, for embedding experts’ knowledge in the fully automatic segmentation, we propose a novel shape boundary-aware evaluator (SBE) with the ground-truth structure information indicated by experts. The proposed SBE could give feedback to the segmentation network based on the structure boundary key point. The proposed method is general and flexible enough to be built on top of any deep learning-based segmentation network. We demonstrate that the proposed method could surpass the state-of-the-art segmentation network and improve the accuracy of three different segmentation network models on different types of medical image datasets.
本文提出了一种新的医学图像分割方法,以解决医学图像域中结构边界的模糊性和分割区域在没有专业领域知识的情况下存在的不确定性。为了解决这两个问题,我们提出了一种新的结构边界保持分割框架。为此,提出了边界关键点选择算法。在该算法中,对目标物体结构边界上的关键点进行估计。然后,利用边界保持块(BPB)和边界关键点图来预测目标物体的结构边界。此外,为了将专家的知识嵌入到全自动分割中,我们提出了一种新的形状边界感知评估器(SBE),该评估器使用专家指示的真值结构信息。所提出的SBE可以根据结构边界关键点对分割网络进行反馈。该方法具有通用性和灵活性,可以建立在任何基于深度学习的分割网络之上。我们证明了所提出的方法可以超越最先进的分割网络,并提高三种不同的分割网络模型在不同类型的医学图像数据集上的准确性。
{"title":"Structure Boundary Preserving Segmentation for Medical Image With Ambiguous Boundary","authors":"Hong Joo Lee, Jung Uk Kim, Sangmin Lee, Hak Gu Kim, Yong Man Ro","doi":"10.1109/CVPR42600.2020.00487","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00487","url":null,"abstract":"In this paper, we propose a novel image segmentation method to tackle two critical problems of medical image, which are (i) ambiguity of structure boundary in the medical image domain and (ii) uncertainty of the segmented region without specialized domain knowledge. To solve those two problems in automatic medical segmentation, we propose a novel structure boundary preserving segmentation framework. To this end, the boundary key point selection algorithm is proposed. In the proposed algorithm, the key points on the structural boundary of the target object are estimated. Then, a boundary preserving block (BPB) with the boundary key point map is applied for predicting the structure boundary of the target object. Further, for embedding experts’ knowledge in the fully automatic segmentation, we propose a novel shape boundary-aware evaluator (SBE) with the ground-truth structure information indicated by experts. The proposed SBE could give feedback to the segmentation network based on the structure boundary key point. The proposed method is general and flexible enough to be built on top of any deep learning-based segmentation network. We demonstrate that the proposed method could surpass the state-of-the-art segmentation network and improve the accuracy of three different segmentation network models on different types of medical image datasets.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"157 5 1","pages":"4816-4825"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89123192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
A Shared Multi-Attention Framework for Multi-Label Zero-Shot Learning 多标签零学习的共享多注意框架
Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00880
Dat T. Huynh, Ehsan Elhamifar
In this work, we develop a shared multi-attention model for multi-label zero-shot learning. We argue that designing attention mechanism for recognizing multiple seen and unseen labels in an image is a non-trivial task as there is no training signal to localize unseen labels and an image only contains a few present labels that need attentions out of thousands of possible labels. Therefore, instead of generating attentions for unseen labels which have unknown behaviors and could focus on irrelevant regions due to the lack of any training sample, we let the unseen labels select among a set of shared attentions which are trained to be label-agnostic and to focus on only relevant/foreground regions through our novel loss. Finally, we learn a compatibility function to distinguish labels based on the selected attention. We further propose a novel loss function that consists of three components guiding the attention to focus on diverse and relevant image regions while utilizing all attention features. By extensive experiments, we show that our method improves the state of the art by 2.9% and 1.4% F1 score on the NUS-WIDE and the large scale Open Images datasets, respectively.
在这项工作中,我们开发了一个用于多标签零次学习的共享多注意模型。我们认为设计用于识别图像中多个可见和未见标签的注意机制是一项非常重要的任务,因为没有训练信号来定位未见标签,并且图像只包含数千个可能标签中需要注意的几个现有标签。因此,我们不是为具有未知行为且由于缺乏任何训练样本而可能关注无关区域的看不见的标签生成关注,而是让看不见的标签在一组共享关注中进行选择,这些共享关注被训练为标签不可知的,并且通过我们的新损失只关注相关/前景区域。最后,我们学习了一个兼容函数,根据选择的注意力来区分标签。我们进一步提出了一种新的损失函数,它由三个部分组成,引导注意力集中在不同的和相关的图像区域,同时利用所有的注意力特征。通过大量的实验,我们表明我们的方法在NUS-WIDE和大规模Open Images数据集上分别提高了2.9%和1.4%的F1分数。
{"title":"A Shared Multi-Attention Framework for Multi-Label Zero-Shot Learning","authors":"Dat T. Huynh, Ehsan Elhamifar","doi":"10.1109/cvpr42600.2020.00880","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00880","url":null,"abstract":"In this work, we develop a shared multi-attention model for multi-label zero-shot learning. We argue that designing attention mechanism for recognizing multiple seen and unseen labels in an image is a non-trivial task as there is no training signal to localize unseen labels and an image only contains a few present labels that need attentions out of thousands of possible labels. Therefore, instead of generating attentions for unseen labels which have unknown behaviors and could focus on irrelevant regions due to the lack of any training sample, we let the unseen labels select among a set of shared attentions which are trained to be label-agnostic and to focus on only relevant/foreground regions through our novel loss. Finally, we learn a compatibility function to distinguish labels based on the selected attention. We further propose a novel loss function that consists of three components guiding the attention to focus on diverse and relevant image regions while utilizing all attention features. By extensive experiments, we show that our method improves the state of the art by 2.9% and 1.4% F1 score on the NUS-WIDE and the large scale Open Images datasets, respectively.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"8773-8783"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90211722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
A Multi-Task Mean Teacher for Semi-Supervised Shadow Detection 半监督阴影检测的多任务均值教师
Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.00565
Zhihao Chen, Lei Zhu, Liang Wan, Song Wang, Wei Feng, P. Heng
Existing shadow detection methods suffer from an intrinsic limitation in relying on limited labeled datasets, and they may produce poor results in some complicated situations. To boost the shadow detection performance, this paper presents a multi-task mean teacher model for semi-supervised shadow detection by leveraging unlabeled data and exploring the learning of multiple information of shadows simultaneously. To be specific, we first build a multi-task baseline model to simultaneously detect shadow regions, shadow edges, and shadow count by leveraging their complementary information and assign this baseline model to the student and teacher network. After that, we encourage the predictions of the three tasks from the student and teacher networks to be consistent for computing a consistency loss on unlabeled data, which is then added to the supervised loss on the labeled data from the predictions of the multi-task baseline model. Experimental results on three widely-used benchmark datasets show that our method consistently outperforms all the compared state-of- the-art methods, which verifies that the proposed network can effectively leverage additional unlabeled data to boost the shadow detection performance.
现有的阴影检测方法存在固有的局限性,依赖于有限的标记数据集,在一些复杂的情况下可能会产生较差的结果。为了提高阴影检测的性能,本文提出了一种多任务平均教师模型用于半监督阴影检测,该模型利用未标记数据,同时探索对阴影多个信息的学习。具体来说,我们首先构建了一个多任务基线模型,利用它们的互补信息同时检测阴影区域、阴影边缘和阴影计数,并将该基线模型分配给学生和教师网络。之后,我们鼓励来自学生和教师网络的三个任务的预测一致,以计算未标记数据上的一致性损失,然后将其添加到来自多任务基线模型预测的标记数据上的监督损失中。在三个广泛使用的基准数据集上的实验结果表明,我们的方法始终优于所有比较的最先进的方法,这验证了所提出的网络可以有效地利用额外的未标记数据来提高阴影检测性能。
{"title":"A Multi-Task Mean Teacher for Semi-Supervised Shadow Detection","authors":"Zhihao Chen, Lei Zhu, Liang Wan, Song Wang, Wei Feng, P. Heng","doi":"10.1109/CVPR42600.2020.00565","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00565","url":null,"abstract":"Existing shadow detection methods suffer from an intrinsic limitation in relying on limited labeled datasets, and they may produce poor results in some complicated situations. To boost the shadow detection performance, this paper presents a multi-task mean teacher model for semi-supervised shadow detection by leveraging unlabeled data and exploring the learning of multiple information of shadows simultaneously. To be specific, we first build a multi-task baseline model to simultaneously detect shadow regions, shadow edges, and shadow count by leveraging their complementary information and assign this baseline model to the student and teacher network. After that, we encourage the predictions of the three tasks from the student and teacher networks to be consistent for computing a consistency loss on unlabeled data, which is then added to the supervised loss on the labeled data from the predictions of the multi-task baseline model. Experimental results on three widely-used benchmark datasets show that our method consistently outperforms all the compared state-of- the-art methods, which verifies that the proposed network can effectively leverage additional unlabeled data to boost the shadow detection performance.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"70 1","pages":"5610-5619"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83793347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 83
Attack to Explain Deep Representation 攻击解释深度表示
Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00956
M. Jalwana, Naveed Akhtar, Bennamoun, A. Mian
Deep visual models are susceptible to extremely low magnitude perturbations to input images. Though carefully crafted, the perturbation patterns generally appear noisy, yet they are able to perform controlled manipulation of model predictions. This observation is used to argue that deep representation is misaligned with human perception. This paper counter-argues and proposes the first attack on deep learning that aims at explaining the learned representation instead of fooling it. By extending the input domain of the manipulative signal and employing a model faithful channelling, we iteratively accumulate adversarial perturbations for a deep model. The accumulated signal gradually manifests itself as a collection of visually salient features of the target label (in model fooling), casting adversarial perturbations as primitive features of the target label. Our attack provides the first demonstration of systematically computing perturbations for adversarially non-robust classifiers that comprise salient visual features of objects. We leverage the model explaining character of our algorithm to perform image generation, inpainting and interactive image manipulation by attacking adversarially robust classifiers. The visually appealing results across these applications demonstrate the utility of our attack (and perturbations in general) beyond model fooling.
深度视觉模型容易受到输入图像的极低量级的扰动。虽然经过精心制作,扰动模式通常看起来嘈杂,但它们能够对模型预测进行控制操作。这一观察结果被用来论证深度表征与人类感知不一致。本文反驳并提出了对深度学习的第一次攻击,旨在解释学习到的表示,而不是欺骗它。通过扩展操纵信号的输入域并采用模型忠实信道,我们迭代地积累了深度模型的对抗性扰动。累积的信号逐渐表现为目标标签的视觉显著特征的集合(在模型欺骗中),将对抗性扰动作为目标标签的原始特征。我们的攻击首次展示了系统地计算包含物体显著视觉特征的对抗非鲁棒分类器的扰动。我们利用模型来解释我们算法的特征,通过攻击对抗鲁棒分类器来执行图像生成、绘制和交互式图像操作。在这些应用程序中,视觉上吸引人的结果证明了我们的攻击(以及一般的扰动)在模型欺骗之外的效用。
{"title":"Attack to Explain Deep Representation","authors":"M. Jalwana, Naveed Akhtar, Bennamoun, A. Mian","doi":"10.1109/cvpr42600.2020.00956","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00956","url":null,"abstract":"Deep visual models are susceptible to extremely low magnitude perturbations to input images. Though carefully crafted, the perturbation patterns generally appear noisy, yet they are able to perform controlled manipulation of model predictions. This observation is used to argue that deep representation is misaligned with human perception. This paper counter-argues and proposes the first attack on deep learning that aims at explaining the learned representation instead of fooling it. By extending the input domain of the manipulative signal and employing a model faithful channelling, we iteratively accumulate adversarial perturbations for a deep model. The accumulated signal gradually manifests itself as a collection of visually salient features of the target label (in model fooling), casting adversarial perturbations as primitive features of the target label. Our attack provides the first demonstration of systematically computing perturbations for adversarially non-robust classifiers that comprise salient visual features of objects. We leverage the model explaining character of our algorithm to perform image generation, inpainting and interactive image manipulation by attacking adversarially robust classifiers. The visually appealing results across these applications demonstrate the utility of our attack (and perturbations in general) beyond model fooling.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"22 1","pages":"9540-9549"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83480243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
ActBERT: Learning Global-Local Video-Text Representations 学习全局-局部视频-文本表示
Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00877
Linchao Zhu, Yi Yang
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.
在本文中,我们引入了ActBERT,用于从未标记数据中联合视频文本表示的自监督学习。首先,我们利用全局行为信息来催化语言文本和局部区域对象之间的相互作用。它从成对的视频序列和文本描述中揭示全局和局部视觉线索,用于详细的视觉和文本关系建模。其次,我们引入了一个纠缠变压器块(ENT)来编码三个信息源,即全局动作、局部区域对象和语言描述。通过从上下文信息中明智地提取线索来发现全局-局部对应关系。它强制联合视频文本表示意识到细粒度对象以及全局的人类意图。我们验证了ActBERT在下游视频和语言任务上的泛化能力,即文本视频剪辑检索、视频字幕、视频问答、动作分割和动作步骤定位。ActBERT的表现明显优于最先进的技术,证明了其在视频文本表示学习方面的优势。
{"title":"ActBERT: Learning Global-Local Video-Text Representations","authors":"Linchao Zhu, Yi Yang","doi":"10.1109/cvpr42600.2020.00877","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00877","url":null,"abstract":"In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"27 1","pages":"8743-8752"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83316133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 337
Learning Integral Objects With Intra-Class Discriminator for Weakly-Supervised Semantic Segmentation 基于类内判别器的弱监督语义分割学习积分对象
Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00434
Junsong Fan, Zhaoxiang Zhang, Chunfeng Song, T. Tan
Image-level weakly-supervised semantic segmentation (WSSS) aims at learning semantic segmentation by adopting only image class labels. Existing approaches generally rely on class activation maps (CAM) to generate pseudo-masks and then train segmentation models. The main difficulty is that the CAM estimate only covers partial foreground objects. In this paper, we argue that the critical factor preventing to obtain the full object mask is the classification boundary mismatch problem in applying the CAM to WSSS. Because the CAM is optimized by the classification task, it focuses on the discrimination across different image-level classes. However, the WSSS requires to distinguish pixels sharing the same image-level class to separate them into the foreground and the background. To alleviate this contradiction, we propose an efficient end-to-end Intra-Class Discriminator (ICD) framework, which learns intra-class boundaries to help separate the foreground and the background within each image-level class. Without bells and whistles, our approach achieves the state-of-the-art performance of image label based WSSS, with mIoU 68.0% on the VOC 2012 semantic segmentation benchmark, demonstrating the effectiveness of the proposed approach.
图像级弱监督语义分割(WSSS)的目的是只采用图像类标签来学习语义分割。现有的方法一般依赖于类激活映射(CAM)来生成伪掩码,然后训练分割模型。主要的困难在于CAM估计只覆盖了部分前景目标。在本文中,我们认为在将CAM应用于WSSS时,阻碍获得完整目标掩码的关键因素是分类边界不匹配问题。由于CAM是根据分类任务进行优化的,因此它关注的是不同图像级分类之间的区分。但是,WSSS要求区分具有相同图像级类的像素,将其分为前景和背景。为了缓解这一矛盾,我们提出了一个有效的端到端类内判别器(ICD)框架,该框架通过学习类内边界来帮助分离每个图像级类内的前景和背景。我们的方法实现了基于图像标签的WSSS的最先进性能,在VOC 2012语义分割基准上的mIoU为68.0%,证明了所提出方法的有效性。
{"title":"Learning Integral Objects With Intra-Class Discriminator for Weakly-Supervised Semantic Segmentation","authors":"Junsong Fan, Zhaoxiang Zhang, Chunfeng Song, T. Tan","doi":"10.1109/cvpr42600.2020.00434","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00434","url":null,"abstract":"Image-level weakly-supervised semantic segmentation (WSSS) aims at learning semantic segmentation by adopting only image class labels. Existing approaches generally rely on class activation maps (CAM) to generate pseudo-masks and then train segmentation models. The main difficulty is that the CAM estimate only covers partial foreground objects. In this paper, we argue that the critical factor preventing to obtain the full object mask is the classification boundary mismatch problem in applying the CAM to WSSS. Because the CAM is optimized by the classification task, it focuses on the discrimination across different image-level classes. However, the WSSS requires to distinguish pixels sharing the same image-level class to separate them into the foreground and the background. To alleviate this contradiction, we propose an efficient end-to-end Intra-Class Discriminator (ICD) framework, which learns intra-class boundaries to help separate the foreground and the background within each image-level class. Without bells and whistles, our approach achieves the state-of-the-art performance of image label based WSSS, with mIoU 68.0% on the VOC 2012 semantic segmentation benchmark, demonstrating the effectiveness of the proposed approach.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"120 1","pages":"4282-4291"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83516192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 145
Hyperbolic Visual Embedding Learning for Zero-Shot Recognition 零射击识别的双曲视觉嵌入学习
Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00929
Shaoteng Liu, Jingjing Chen, Liangming Pan, C. Ngo, Tat-seng Chua, Yu-Gang Jiang
This paper proposes a Hyperbolic Visual Embedding Learning Network for zero-shot recognition. The network learns image embeddings in hyperbolic space, which is capable of preserving the hierarchical structure of semantic classes in low dimensions. Comparing with existing zero-shot learning approaches, the network is more robust because the embedding feature in hyperbolic space better represents class hierarchy and thereby avoid misleading resulted from unrelated siblings. Our network outperforms exiting baselines under hierarchical evaluation with an extremely challenging setting, textit{i.e.,} learning only from 1,000 categories to recognize 20,841 unseen categories. While under flat evaluation, it has competitive performance as state-of-the-art methods but with five times lower embedding dimensions. Our code is publicly available footnote{url{https://github.com/ShaoTengLiu/Hyperbolic_ZSL}}.
提出了一种用于零射击识别的双曲视觉嵌入学习网络。该网络在双曲空间中学习图像嵌入,能够在低维空间中保持语义类的层次结构。与现有的零次学习方法相比,该网络的鲁棒性更强,因为双曲空间的嵌入特征更好地代表了类的层次结构,从而避免了不相关的兄弟姐妹带来的误导。我们的网络在一个极具挑战性的设置下,在分层评估下优于现有的基线,textit{即}仅从1,000个类别中学习以识别20,841个未见过的类别。在扁平化评价下,它具有与最先进的方法相媲美的性能,但嵌入维数降低了5倍。我们的代码是公开的footnote{url{https://github.com/ShaoTengLiu/Hyperbolic_ZSL}}。
{"title":"Hyperbolic Visual Embedding Learning for Zero-Shot Recognition","authors":"Shaoteng Liu, Jingjing Chen, Liangming Pan, C. Ngo, Tat-seng Chua, Yu-Gang Jiang","doi":"10.1109/cvpr42600.2020.00929","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00929","url":null,"abstract":"This paper proposes a Hyperbolic Visual Embedding Learning Network for zero-shot recognition. The network learns image embeddings in hyperbolic space, which is capable of preserving the hierarchical structure of semantic classes in low dimensions. Comparing with existing zero-shot learning approaches, the network is more robust because the embedding feature in hyperbolic space better represents class hierarchy and thereby avoid misleading resulted from unrelated siblings. Our network outperforms exiting baselines under hierarchical evaluation with an extremely challenging setting, textit{i.e.,} learning only from 1,000 categories to recognize 20,841 unseen categories. While under flat evaluation, it has competitive performance as state-of-the-art methods but with five times lower embedding dimensions. Our code is publicly available footnote{url{https://github.com/ShaoTengLiu/Hyperbolic_ZSL}}.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"12 1","pages":"9270-9278"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84741841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 93
期刊
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1