首页 > 最新文献

2019 IEEE/CVF International Conference on Computer Vision (ICCV)最新文献

英文 中文
CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization CamNet:相机再定位的粗到精检索
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00296
Mingyu Ding, Zhe Wang, Jiankai Sun, Jianping Shi, P. Luo
Camera re-localization is an important but challenging task in applications like robotics and autonomous driving. Recently, retrieval-based methods have been considered as a promising direction as they can be easily generalized to novel scenes. Despite significant progress has been made, we observe that the performance bottleneck of previous methods actually lies in the retrieval module. These methods use the same features for both retrieval and relative pose regression tasks which have potential conflicts in learning. To this end, here we present a coarse-to-fine retrieval-based deep learning framework, which includes three steps, i.e., image-based coarse retrieval, pose-based fine retrieval and precise relative pose regression. With our carefully designed retrieval module, the relative pose regression task can be surprisingly simpler. We design novel retrieval losses with batch hard sampling criterion and two-stage retrieval to locate samples that adapt to the relative pose regression task. Extensive experiments show that our model (CamNet) outperforms the state-of-the-art methods by a large margin on both indoor and outdoor datasets.
在机器人和自动驾驶等应用中,摄像头重新定位是一项重要但具有挑战性的任务。近年来,基于检索的方法被认为是一个有前途的方向,因为它可以很容易地推广到新的场景。尽管已经取得了重大进展,但我们观察到,以前的方法的性能瓶颈实际上在于检索模块。这些方法对检索任务和相对姿态回归任务使用相同的特征,但在学习中存在潜在的冲突。为此,我们提出了一个基于粗到细检索的深度学习框架,该框架包括三个步骤,即基于图像的粗检索、基于姿态的精细检索和精确相对姿态回归。通过我们精心设计的检索模块,相对姿态回归任务可以变得非常简单。我们设计了新的检索损失,采用批硬采样准则和两阶段检索来定位适应相对位姿回归任务的样本。大量的实验表明,我们的模型(CamNet)在室内和室外数据集上的表现都大大优于最先进的方法。
{"title":"CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization","authors":"Mingyu Ding, Zhe Wang, Jiankai Sun, Jianping Shi, P. Luo","doi":"10.1109/ICCV.2019.00296","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00296","url":null,"abstract":"Camera re-localization is an important but challenging task in applications like robotics and autonomous driving. Recently, retrieval-based methods have been considered as a promising direction as they can be easily generalized to novel scenes. Despite significant progress has been made, we observe that the performance bottleneck of previous methods actually lies in the retrieval module. These methods use the same features for both retrieval and relative pose regression tasks which have potential conflicts in learning. To this end, here we present a coarse-to-fine retrieval-based deep learning framework, which includes three steps, i.e., image-based coarse retrieval, pose-based fine retrieval and precise relative pose regression. With our carefully designed retrieval module, the relative pose regression task can be surprisingly simpler. We design novel retrieval losses with batch hard sampling criterion and two-stage retrieval to locate samples that adapt to the relative pose regression task. Extensive experiments show that our model (CamNet) outperforms the state-of-the-art methods by a large margin on both indoor and outdoor datasets.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"22 1","pages":"2871-2880"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81509505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 101
AWSD: Adaptive Weighted Spatiotemporal Distillation for Video Representation AWSD:视频表示的自适应加权时空蒸馏
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00811
M. Tavakolian, H. R. Tavakoli, A. Hadid
We propose an Adaptive Weighted Spatiotemporal Distillation (AWSD) technique for video representation by encoding the appearance and dynamics of the videos into a single RGB image map. This is obtained by adaptively dividing the videos into small segments and comparing two consecutive segments. This allows using pre-trained models on still images for video classification while successfully capturing the spatiotemporal variations in the videos. The adaptive segment selection enables effective encoding of the essential discriminative information of untrimmed videos. Based on Gaussian Scale Mixture, we compute the weights by extracting the mutual information between two consecutive segments. Unlike pooling-based methods, our AWSD gives more importance to the frames that characterize actions or events thanks to its adaptive segment length selection. We conducted extensive experimental analysis to evaluate the effectiveness of our proposed method and compared our results against those of recent state-of-the-art methods on four benchmark datatsets, including UCF101, HMDB51, ActivityNet v1.3, and Maryland. The obtained results on these benchmark datatsets showed that our method significantly outperforms earlier works and sets the new state-of-the-art performance in video classification. Code is available at the project webpage: https://mohammadt68.github.io/AWSD/
我们提出了一种自适应加权时空蒸馏(AWSD)技术,通过将视频的外观和动态编码到单个RGB图像映射中来表示视频。这是通过自适应地将视频分成小段,并比较两个连续的片段来获得的。这允许在静止图像上使用预训练模型进行视频分类,同时成功捕获视频中的时空变化。自适应片段选择能够有效地对未修剪视频的基本判别信息进行编码。在高斯尺度混合的基础上,通过提取两个连续片段之间的互信息来计算权重。与基于池的方法不同,我们的AWSD由于其自适应片段长度选择而更加重视表征动作或事件的帧。我们进行了广泛的实验分析,以评估我们提出的方法的有效性,并将我们的结果与四个基准数据集(包括UCF101、HMDB51、ActivityNet v1.3和Maryland)上最新的最先进方法的结果进行了比较。在这些基准数据集上获得的结果表明,我们的方法明显优于先前的工作,并在视频分类中设置了新的最先进的性能。代码可在项目网页上获得:https://mohammadt68.github.io/AWSD/
{"title":"AWSD: Adaptive Weighted Spatiotemporal Distillation for Video Representation","authors":"M. Tavakolian, H. R. Tavakoli, A. Hadid","doi":"10.1109/ICCV.2019.00811","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00811","url":null,"abstract":"We propose an Adaptive Weighted Spatiotemporal Distillation (AWSD) technique for video representation by encoding the appearance and dynamics of the videos into a single RGB image map. This is obtained by adaptively dividing the videos into small segments and comparing two consecutive segments. This allows using pre-trained models on still images for video classification while successfully capturing the spatiotemporal variations in the videos. The adaptive segment selection enables effective encoding of the essential discriminative information of untrimmed videos. Based on Gaussian Scale Mixture, we compute the weights by extracting the mutual information between two consecutive segments. Unlike pooling-based methods, our AWSD gives more importance to the frames that characterize actions or events thanks to its adaptive segment length selection. We conducted extensive experimental analysis to evaluate the effectiveness of our proposed method and compared our results against those of recent state-of-the-art methods on four benchmark datatsets, including UCF101, HMDB51, ActivityNet v1.3, and Maryland. The obtained results on these benchmark datatsets showed that our method significantly outperforms earlier works and sets the new state-of-the-art performance in video classification. Code is available at the project webpage: https://mohammadt68.github.io/AWSD/","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"5 1","pages":"8019-8028"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84308891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Self-Supervised Difference Detection for Weakly-Supervised Semantic Segmentation 弱监督语义分割的自监督差分检测
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00531
Wataru Shimoda, Keiji Yanai
To minimize the annotation costs associated with the training of semantic segmentation models, researchers have extensively investigated weakly-supervised segmentation approaches. In the current weakly-supervised segmentation methods, the most widely adopted approach is based on visualization. However, the visualization results are not generally equal to semantic segmentation. Therefore, to perform accurate semantic segmentation under the weakly supervised condition, it is necessary to consider the mapping functions that convert the visualization results into semantic segmentation. For such mapping functions, the conditional random field and iterative re-training using the outputs of a segmentation model are usually used. However, these methods do not always guarantee improvements in accuracy; therefore, if we apply these mapping functions iteratively multiple times, eventually the accuracy will not improve or will decrease. In this paper, to make the most of such mapping functions, we assume that the results of the mapping function include noise, and we improve the accuracy by removing noise. To achieve our aim, we propose the self-supervised difference detection module, which estimates noise from the results of the mapping functions by predicting the difference between the segmentation masks before and after the mapping. We verified the effectiveness of the proposed method by performing experiments on the PASCAL Visual Object Classes 2012 dataset, and we achieved 64.9% in the val set and 65.5% in the test set. Both of the results become new state-of-the-art under the same setting of weakly supervised semantic segmentation.
为了最大限度地减少与语义分割模型训练相关的标注成本,研究人员广泛研究了弱监督分割方法。在目前的弱监督分割方法中,采用最广泛的是基于可视化的分割方法。然而,可视化结果通常不等于语义分割。因此,为了在弱监督条件下进行准确的语义分割,需要考虑将可视化结果转化为语义分割的映射函数。对于这种映射函数,通常使用条件随机场和使用分割模型输出的迭代再训练。然而,这些方法并不总是保证准确性的提高;因此,如果我们多次迭代地应用这些映射函数,最终精度将不会提高或降低。在本文中,为了充分利用这类映射函数,我们假设映射函数的结果包含噪声,并通过去除噪声来提高精度。为了实现我们的目标,我们提出了自监督差分检测模块,该模块通过预测映射前后分割掩码之间的差异,从映射函数的结果中估计噪声。通过在PASCAL Visual Object Classes 2012数据集上的实验,验证了该方法的有效性,在val集和test集上的准确率分别达到了64.9%和65.5%。在相同的弱监督语义分割设置下,这两种结果都成为新的研究领域。
{"title":"Self-Supervised Difference Detection for Weakly-Supervised Semantic Segmentation","authors":"Wataru Shimoda, Keiji Yanai","doi":"10.1109/ICCV.2019.00531","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00531","url":null,"abstract":"To minimize the annotation costs associated with the training of semantic segmentation models, researchers have extensively investigated weakly-supervised segmentation approaches. In the current weakly-supervised segmentation methods, the most widely adopted approach is based on visualization. However, the visualization results are not generally equal to semantic segmentation. Therefore, to perform accurate semantic segmentation under the weakly supervised condition, it is necessary to consider the mapping functions that convert the visualization results into semantic segmentation. For such mapping functions, the conditional random field and iterative re-training using the outputs of a segmentation model are usually used. However, these methods do not always guarantee improvements in accuracy; therefore, if we apply these mapping functions iteratively multiple times, eventually the accuracy will not improve or will decrease. In this paper, to make the most of such mapping functions, we assume that the results of the mapping function include noise, and we improve the accuracy by removing noise. To achieve our aim, we propose the self-supervised difference detection module, which estimates noise from the results of the mapping functions by predicting the difference between the segmentation masks before and after the mapping. We verified the effectiveness of the proposed method by performing experiments on the PASCAL Visual Object Classes 2012 dataset, and we achieved 64.9% in the val set and 65.5% in the test set. Both of the results become new state-of-the-art under the same setting of weakly supervised semantic segmentation.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"20 1","pages":"5207-5216"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84334766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 116
Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning 视频字幕的联合语法表示学习与视觉提示翻译
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00901
Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, Yunde Jia
Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.
视频字幕是一项具有挑战性的任务,不仅涉及视觉感知,还涉及语法表征学习。视频字幕的最新进展是通过视觉感知实现的,但语法表示学习仍未得到充分的探索。我们提出了一种新的视频字幕方法,该方法同时考虑了视觉感知和语法表示学习,以生成准确的视频描述。具体来说,我们使用词性标签组成的句子模板来表示字幕的句法结构,相应地,句法表示学习是通过直接从视频中推断词性标签来完成的。视觉感知是通过一个混合模型来实现的,该模型将视觉线索转化为词汇,这些词汇取决于学习到的句子的句法结构。因此,视频字幕任务由两个子任务组成:视频POS标记和视觉提示翻译,这两个子任务以端到端方式联合建模和训练。在三个公共基准数据集上的评估表明,我们提出的方法取得了比目前最先进的方法更好的性能,这验证了语法表示学习和视觉感知联合建模用于视频字幕的优越性。
{"title":"Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning","authors":"Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, Yunde Jia","doi":"10.1109/ICCV.2019.00901","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00901","url":null,"abstract":"Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"29 1","pages":"8917-8926"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84468281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
Deep Multi-Model Fusion for Single-Image Dehazing 单幅图像去雾的深度多模型融合
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00254
Zijun Deng, Lei Zhu, Xiaowei Hu, Chi-Wing Fu, Xuemiao Xu, Qing Zhang, J. Qin, P. Heng
This paper presents a deep multi-model fusion network to attentively integrate multiple models to separate layers and boost the performance in single-image dehazing. To do so, we first formulate the attentional feature integration module to maximize the integration of the convolutional neural network (CNN) features at different CNN layers and generate the attentional multi-level integrated features (AMLIF). Then, from the AMLIF, we further predict a haze-free result for an atmospheric scattering model, as well as for four haze-layer separation models, and then fuse the results together to produce the final haze-free image. To evaluate the effectiveness of our method, we compare our network with several state-of-the-art methods on two widely-used dehazing benchmark datasets, as well as on two sets of real-world hazy images. Experimental results demonstrate clear quantitative and qualitative improvements of our method over the state-of-the-arts.
本文提出了一种深度多模型融合网络,将多个模型集中在一起进行分层,提高了单幅图像去雾的性能。为此,我们首先制定了注意力特征集成模块,以最大限度地集成卷积神经网络(CNN)在不同CNN层的特征,并生成注意力多层次集成特征(AMLIF)。然后,利用AMLIF进一步预测大气散射模型和四种雾层分离模型的无雾结果,然后将结果融合在一起,得到最终的无雾图像。为了评估我们方法的有效性,我们将我们的网络与几种最先进的方法在两个广泛使用的去雾基准数据集以及两组真实的朦胧图像上进行了比较。实验结果表明,我们的方法在定量和定性上都有了明显的改进。
{"title":"Deep Multi-Model Fusion for Single-Image Dehazing","authors":"Zijun Deng, Lei Zhu, Xiaowei Hu, Chi-Wing Fu, Xuemiao Xu, Qing Zhang, J. Qin, P. Heng","doi":"10.1109/ICCV.2019.00254","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00254","url":null,"abstract":"This paper presents a deep multi-model fusion network to attentively integrate multiple models to separate layers and boost the performance in single-image dehazing. To do so, we first formulate the attentional feature integration module to maximize the integration of the convolutional neural network (CNN) features at different CNN layers and generate the attentional multi-level integrated features (AMLIF). Then, from the AMLIF, we further predict a haze-free result for an atmospheric scattering model, as well as for four haze-layer separation models, and then fuse the results together to produce the final haze-free image. To evaluate the effectiveness of our method, we compare our network with several state-of-the-art methods on two widely-used dehazing benchmark datasets, as well as on two sets of real-world hazy images. Experimental results demonstrate clear quantitative and qualitative improvements of our method over the state-of-the-arts.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"4 1","pages":"2453-2462"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84987164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
Adversarial Defense via Learning to Generate Diverse Attacks 通过学习产生不同的攻击进行对抗性防御
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00283
Y. Jang, Tianchen Zhao, Seunghoon Hong, Honglak Lee
With the remarkable success of deep learning, Deep Neural Networks (DNNs) have been applied as dominant tools to various machine learning domains. Despite this success, however, it has been found that DNNs are surprisingly vulnerable to malicious attacks; adding a small, perceptually indistinguishable perturbations to the data can easily degrade classification performance. Adversarial training is an effective defense strategy to train a robust classifier. In this work, we propose to utilize the generator to learn how to create adversarial examples. Unlike the existing approaches that create a one-shot perturbation by a deterministic generator, we propose a recursive and stochastic generator that produces much stronger and diverse perturbations that comprehensively reveal the vulnerability of the target classifier. Our experiment results on MNIST and CIFAR-10 datasets show that the classifier adversarially trained with our method yields more robust performance over various white-box and black-box attacks.
随着深度学习的显著成功,深度神经网络(dnn)已作为主导工具应用于各种机器学习领域。然而,尽管取得了这一成功,但人们发现dnn非常容易受到恶意攻击;向数据中添加一个小的、感知上无法区分的扰动很容易降低分类性能。对抗训练是训练鲁棒分类器的有效防御策略。在这项工作中,我们建议利用生成器来学习如何创建对抗性示例。与现有的通过确定性生成器产生一次性扰动的方法不同,我们提出了一个递归和随机生成器,它产生更强和更多样化的扰动,全面揭示目标分类器的脆弱性。我们在MNIST和CIFAR-10数据集上的实验结果表明,使用我们的方法对抗性训练的分类器在各种白盒和黑盒攻击中具有更强的鲁棒性。
{"title":"Adversarial Defense via Learning to Generate Diverse Attacks","authors":"Y. Jang, Tianchen Zhao, Seunghoon Hong, Honglak Lee","doi":"10.1109/ICCV.2019.00283","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00283","url":null,"abstract":"With the remarkable success of deep learning, Deep Neural Networks (DNNs) have been applied as dominant tools to various machine learning domains. Despite this success, however, it has been found that DNNs are surprisingly vulnerable to malicious attacks; adding a small, perceptually indistinguishable perturbations to the data can easily degrade classification performance. Adversarial training is an effective defense strategy to train a robust classifier. In this work, we propose to utilize the generator to learn how to create adversarial examples. Unlike the existing approaches that create a one-shot perturbation by a deterministic generator, we propose a recursive and stochastic generator that produces much stronger and diverse perturbations that comprehensively reveal the vulnerability of the target classifier. Our experiment results on MNIST and CIFAR-10 datasets show that the classifier adversarially trained with our method yields more robust performance over various white-box and black-box attacks.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"81 1","pages":"2740-2749"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80910541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval 情感图像检索的注意感知极性敏感嵌入
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00123
Xingxu Yao, Dongyu She, Sicheng Zhao, Jie Liang, Yu-Kun Lai, Jufeng Yang
Images play a crucial role for people to express their opinions online due to the increasing popularity of social networks. While an affective image retrieval system is useful for obtaining visual contents with desired emotions from a massive repository, the abstract and subjective characteristics make the task challenging. To address the problem, this paper introduces an Attention-aware Polarity Sensitive Embedding (APSE) network to learn affective representations in an end-to-end manner. First, to automatically discover and model the informative regions of interest, we develop a hierarchical attention mechanism, in which both polarity- and emotion-specific attended representations are aggregated for discriminative feature embedding. Second, we present a weighted emotion-pair loss to take the inter- and intra-polarity relationships of the emotional labels into consideration. Guided by attention module, we weight the sample pairs adaptively which further improves the performance of feature embedding. Extensive experiments on four popular benchmark datasets show that the proposed method performs favorably against the state-of-the-art approaches.
由于社交网络的日益普及,图片对人们在网上表达自己的观点起着至关重要的作用。虽然情感图像检索系统对于从海量存储库中获取具有期望情感的视觉内容是有用的,但抽象和主观的特征使任务具有挑战性。为了解决这个问题,本文引入了一个注意感知极性敏感嵌入(APSE)网络,以端到端方式学习情感表征。首先,为了自动发现和建模感兴趣的信息区域,我们开发了一种分层注意机制,其中极性和情感特定的出席表示被聚合以进行判别特征嵌入。其次,我们提出了一个加权情感对损失,以考虑情感标签的极性间和极性内关系。在注意力模块的引导下,自适应地对样本对进行加权,进一步提高了特征嵌入的性能。在四个流行的基准数据集上进行的大量实验表明,所提出的方法优于最先进的方法。
{"title":"Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval","authors":"Xingxu Yao, Dongyu She, Sicheng Zhao, Jie Liang, Yu-Kun Lai, Jufeng Yang","doi":"10.1109/ICCV.2019.00123","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00123","url":null,"abstract":"Images play a crucial role for people to express their opinions online due to the increasing popularity of social networks. While an affective image retrieval system is useful for obtaining visual contents with desired emotions from a massive repository, the abstract and subjective characteristics make the task challenging. To address the problem, this paper introduces an Attention-aware Polarity Sensitive Embedding (APSE) network to learn affective representations in an end-to-end manner. First, to automatically discover and model the informative regions of interest, we develop a hierarchical attention mechanism, in which both polarity- and emotion-specific attended representations are aggregated for discriminative feature embedding. Second, we present a weighted emotion-pair loss to take the inter- and intra-polarity relationships of the emotional labels into consideration. Guided by attention module, we weight the sample pairs adaptively which further improves the performance of feature embedding. Extensive experiments on four popular benchmark datasets show that the proposed method performs favorably against the state-of-the-art approaches.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"21 1","pages":"1140-1150"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85439336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Universal Adversarial Perturbation via Prior Driven Uncertainty Approximation 基于先验驱动不确定性近似的普遍对抗性扰动
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00303
Hong Liu, Rongrong Ji, Jie Li, Baochang Zhang, Yue Gao, Yongjian Wu, Feiyue Huang
Deep learning models have shown their vulnerabilities to universal adversarial perturbations (UAP), which are quasi-imperceptible. Compared to the conventional supervised UAPs that suffer from the knowledge of training data, the data-independent unsupervised UAPs are more applicable. Existing unsupervised methods fail to take advantage of the model uncertainty to produce robust perturbations. In this paper, we propose a new unsupervised universal adversarial perturbation method, termed as Prior Driven Uncertainty Approximation (PD-UA), to generate a robust UAP by fully exploiting the model uncertainty at each network layer. Specifically, a Monte Carlo sampling method is deployed to activate more neurons to increase the model uncertainty for a better adversarial perturbation. Thereafter, a textural bias prior to revealing a statistical uncertainty is proposed, which helps to improve the attacking performance. The UAP is crafted by the stochastic gradient descent algorithm with a boosted momentum optimizer, and a Laplacian pyramid frequency model is finally used to maintain the statistical uncertainty. Extensive experiments demonstrate that our method achieves well attacking performances on the ImageNet validation set, and significantly improves the fooling rate compared with the state-of-the-art methods.
深度学习模型已经显示出它们对准不可察觉的普遍对抗性扰动(UAP)的脆弱性。与传统的受训练数据知识限制的有监督uap相比,独立于数据的无监督uap更适用。现有的无监督方法不能利用模型的不确定性产生鲁棒摄动。在本文中,我们提出了一种新的无监督通用对抗摄动方法,称为先验驱动不确定性近似(PD-UA),通过充分利用模型在每个网络层的不确定性来生成鲁棒的UAP。具体而言,采用蒙特卡罗采样方法激活更多的神经元,以增加模型的不确定性,从而获得更好的对抗性扰动。在此基础上,提出了在统计不确定性暴露之前的纹理偏差,这有助于提高攻击性能。UAP由随机梯度下降算法和增强动量优化器构建,最后使用拉普拉斯金字塔频率模型来保持统计不确定性。大量的实验表明,我们的方法在ImageNet验证集上取得了良好的攻击性能,与现有方法相比,显著提高了欺骗率。
{"title":"Universal Adversarial Perturbation via Prior Driven Uncertainty Approximation","authors":"Hong Liu, Rongrong Ji, Jie Li, Baochang Zhang, Yue Gao, Yongjian Wu, Feiyue Huang","doi":"10.1109/ICCV.2019.00303","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00303","url":null,"abstract":"Deep learning models have shown their vulnerabilities to universal adversarial perturbations (UAP), which are quasi-imperceptible. Compared to the conventional supervised UAPs that suffer from the knowledge of training data, the data-independent unsupervised UAPs are more applicable. Existing unsupervised methods fail to take advantage of the model uncertainty to produce robust perturbations. In this paper, we propose a new unsupervised universal adversarial perturbation method, termed as Prior Driven Uncertainty Approximation (PD-UA), to generate a robust UAP by fully exploiting the model uncertainty at each network layer. Specifically, a Monte Carlo sampling method is deployed to activate more neurons to increase the model uncertainty for a better adversarial perturbation. Thereafter, a textural bias prior to revealing a statistical uncertainty is proposed, which helps to improve the attacking performance. The UAP is crafted by the stochastic gradient descent algorithm with a boosted momentum optimizer, and a Laplacian pyramid frequency model is finally used to maintain the statistical uncertainty. Extensive experiments demonstrate that our method achieves well attacking performances on the ImageNet validation set, and significantly improves the fooling rate compared with the state-of-the-art methods.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"58 1","pages":"2941-2949"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84029013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Monocular Piecewise Depth Estimation in Dynamic Scenes by Exploiting Superpixel Relations 基于超像素关系的动态场景单目分段深度估计
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00446
D. Yan, Henrique Morimitsu, Shan Gao, Xiangyang Ji
In this paper, we propose a novel and specially designed method for piecewise dense monocular depth estimation in dynamic scenes. We utilize spatial relations between neighboring superpixels to solve the inherent relative scale ambiguity (RSA) problem and smooth the depth map. However, directly estimating spatial relations is an ill-posed problem. Our core idea is to predict spatial relations based on the corresponding motion relations. Given two or more consecutive frames, we first compute semi-dense (CPM) or dense (optical flow) point matches between temporally neighboring images. Then we develop our method in four main stages: superpixel relations analysis, motion selection, reconstruction, and refinement. The final refinement process helps to improve the quality of the reconstruction at pixel level. Our method does not require per-object segmentation, template priors or training sets, which ensures flexibility in various applications. Extensive experiments on both synthetic and real datasets demonstrate that our method robustly handles different dynamic situations and presents competitive results to the state-of-the-art methods while running much faster than them.
本文提出了一种针对动态场景的分段密集单目深度估计的新方法。我们利用相邻超像素之间的空间关系来解决固有的相对尺度模糊(RSA)问题,并平滑深度图。然而,直接估计空间关系是一个不适定问题。我们的核心思想是基于相应的运动关系来预测空间关系。给定两个或多个连续帧,我们首先计算时间相邻图像之间的半密集(CPM)或密集(光流)点匹配。然后,我们将该方法分为四个主要阶段:超像素关系分析、运动选择、重建和细化。最后的细化过程有助于提高像素级重建的质量。我们的方法不需要每个对象分割,模板先验或训练集,这确保了各种应用的灵活性。在合成数据集和真实数据集上的大量实验表明,我们的方法鲁棒地处理了不同的动态情况,并且在运行速度比最先进的方法快得多的同时,提供了具有竞争力的结果。
{"title":"Monocular Piecewise Depth Estimation in Dynamic Scenes by Exploiting Superpixel Relations","authors":"D. Yan, Henrique Morimitsu, Shan Gao, Xiangyang Ji","doi":"10.1109/ICCV.2019.00446","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00446","url":null,"abstract":"In this paper, we propose a novel and specially designed method for piecewise dense monocular depth estimation in dynamic scenes. We utilize spatial relations between neighboring superpixels to solve the inherent relative scale ambiguity (RSA) problem and smooth the depth map. However, directly estimating spatial relations is an ill-posed problem. Our core idea is to predict spatial relations based on the corresponding motion relations. Given two or more consecutive frames, we first compute semi-dense (CPM) or dense (optical flow) point matches between temporally neighboring images. Then we develop our method in four main stages: superpixel relations analysis, motion selection, reconstruction, and refinement. The final refinement process helps to improve the quality of the reconstruction at pixel level. Our method does not require per-object segmentation, template priors or training sets, which ensures flexibility in various applications. Extensive experiments on both synthetic and real datasets demonstrate that our method robustly handles different dynamic situations and presents competitive results to the state-of-the-art methods while running much faster than them.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"76 1","pages":"4362-4371"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84034195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Order-Aware Generative Modeling Using the 3D-Craft Dataset 使用3D-Craft数据集的顺序感知生成建模
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00185
Zhuoyuan Chen, Demi Guo, Tong Xiao, Saining Xie, Xinlei Chen, Haonan Yu, Jonathan Gray, Kavya Srinet, Haoqi Fan, Jerry Ma, C. Qi, Shubham Tulsiani, Arthur Szlam, C. L. Zitnick
In this paper, we study the problem of sequentially building houses in the game of Minecraft, and demonstrate that learning the ordering can make for more effective autoregressive models. Given a partially built house made by a human player, our system tries to place additional blocks in a human-like manner to complete the house. We introduce a new dataset, HouseCraft, for this new task. HouseCraft contains the sequential order in which 2,500 Minecraft houses were built from scratch by humans. The human action sequences enable us to learn an order-aware generative model called Voxel-CNN. In contrast to many generative models where the sequential generation ordering either does not matter (e.g. holistic generation with GANs), or is manually/arbitrarily set by simple rules (e.g. raster-scan order), our focus is on an ordered generation that imitates humans. To evaluate if a generative model can accurately predict human-like actions, we propose several novel quantitative metrics. We demonstrate that our Voxel-CNN model is simple and effective at this creative task, and can serve as a strong baseline for future research in this direction. The HouseCraft dataset and code with baseline models will be made publicly available.
本文研究了《我的世界》游戏中顺序建造房屋的问题,并证明了学习顺序可以建立更有效的自回归模型。给定由人类玩家建造的部分房屋,我们的系统会尝试以类似人类的方式放置额外的砖块来完成房屋。我们为这个新任务引入了一个新的数据集,housesecraft。housesecraft包含了人类从零开始建造的2500个Minecraft房屋的顺序。人类动作序列使我们能够学习一种称为Voxel-CNN的顺序感知生成模型。与许多生成模型相比,序列生成顺序要么无关紧要(例如gan的整体生成),要么由简单规则手动/任意设置(例如光栅扫描顺序),我们的重点是模仿人类的有序生成。为了评估生成模型是否能准确预测类人行为,我们提出了几个新的定量指标。我们证明了我们的Voxel-CNN模型在这个创造性的任务中是简单有效的,并且可以作为这个方向的未来研究的强大基线。housesecraft数据集和带有基线模型的代码将公开提供。
{"title":"Order-Aware Generative Modeling Using the 3D-Craft Dataset","authors":"Zhuoyuan Chen, Demi Guo, Tong Xiao, Saining Xie, Xinlei Chen, Haonan Yu, Jonathan Gray, Kavya Srinet, Haoqi Fan, Jerry Ma, C. Qi, Shubham Tulsiani, Arthur Szlam, C. L. Zitnick","doi":"10.1109/ICCV.2019.00185","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00185","url":null,"abstract":"In this paper, we study the problem of sequentially building houses in the game of Minecraft, and demonstrate that learning the ordering can make for more effective autoregressive models. Given a partially built house made by a human player, our system tries to place additional blocks in a human-like manner to complete the house. We introduce a new dataset, HouseCraft, for this new task. HouseCraft contains the sequential order in which 2,500 Minecraft houses were built from scratch by humans. The human action sequences enable us to learn an order-aware generative model called Voxel-CNN. In contrast to many generative models where the sequential generation ordering either does not matter (e.g. holistic generation with GANs), or is manually/arbitrarily set by simple rules (e.g. raster-scan order), our focus is on an ordered generation that imitates humans. To evaluate if a generative model can accurately predict human-like actions, we propose several novel quantitative metrics. We demonstrate that our Voxel-CNN model is simple and effective at this creative task, and can serve as a strong baseline for future research in this direction. The HouseCraft dataset and code with baseline models will be made publicly available.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"6 1","pages":"1764-1773"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84037407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1