2019 IEEE/CVF International Conference on Computer Vision (ICCV)最新文献_第7页

Attacking Optical Flow 攻击光流

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00249

Anurag Ranjan, J. Janai, Andreas Geiger, Michael J. Black

Deep neural nets achieve state-of-the-art performance on the problem of optical flow estimation. Since optical flow is used in several safety-critical applications like self-driving cars, it is important to gain insights into the robustness of those techniques. Recently, it has been shown that adversarial attacks easily fool deep neural networks to misclassify objects. The robustness of optical flow networks to adversarial attacks, however, has not been studied so far. In this paper, we extend adversarial patch attacks to optical flow networks and show that such attacks can compromise their performance. We show that corrupting a small patch of less than 1% of the image size can significantly affect optical flow estimates. Our attacks lead to noisy flow estimates that extend significantly beyond the region of the attack, in many cases even completely erasing the motion of objects in the scene. While networks using an encoder-decoder architecture are very sensitive to these attacks, we found that networks using a spatial pyramid architecture are less affected. We analyse the success and failure of attacking both architectures by visualizing their feature maps and comparing them to classical optical flow techniques which are robust to these attacks. We also demonstrate that such attacks are practical by placing a printed pattern into real scenes.

深度神经网络在光流估计问题上达到了最先进的性能。由于光流用于自动驾驶汽车等几个安全关键应用，因此深入了解这些技术的稳健性非常重要。最近的研究表明，对抗性攻击很容易欺骗深度神经网络对对象进行错误分类。然而，光流网络对对抗性攻击的鲁棒性迄今尚未得到研究。在本文中，我们将对抗性补丁攻击扩展到光流网络，并表明这种攻击会损害其性能。我们表明，损坏小于图像尺寸的1%的小块可以显着影响光流估计。我们的攻击导致噪声流估计大大超出了攻击区域，在许多情况下甚至完全消除了场景中物体的运动。虽然使用编码器-解码器架构的网络对这些攻击非常敏感，但我们发现使用空间金字塔架构的网络受影响较小。我们分析了攻击这两种架构的成功和失败，通过可视化它们的特征映射，并将它们与对这些攻击具有鲁棒性的经典光流技术进行比较。我们还通过将打印图案放置在真实场景中来证明这种攻击是可行的。

{"title":"Attacking Optical Flow","authors":"Anurag Ranjan, J. Janai, Andreas Geiger, Michael J. Black","doi":"10.1109/ICCV.2019.00249","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00249","url":null,"abstract":"Deep neural nets achieve state-of-the-art performance on the problem of optical flow estimation. Since optical flow is used in several safety-critical applications like self-driving cars, it is important to gain insights into the robustness of those techniques. Recently, it has been shown that adversarial attacks easily fool deep neural networks to misclassify objects. The robustness of optical flow networks to adversarial attacks, however, has not been studied so far. In this paper, we extend adversarial patch attacks to optical flow networks and show that such attacks can compromise their performance. We show that corrupting a small patch of less than 1% of the image size can significantly affect optical flow estimates. Our attacks lead to noisy flow estimates that extend significantly beyond the region of the attack, in many cases even completely erasing the motion of objects in the scene. While networks using an encoder-decoder architecture are very sensitive to these attacks, we found that networks using a spatial pyramid architecture are less affected. We analyse the success and failure of attacking both architectures by visualizing their feature maps and comparing them to classical optical flow techniques which are robust to these attacks. We also demonstrate that such attacks are practical by placing a printed pattern into real scenes.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"7 1","pages":"2404-2413"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82552863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 78

Learning to Jointly Generate and Separate Reflections 学会共同产生和分离反射

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00253

Daiqian Ma, Renjie Wan, Boxin Shi, A. Kot, Ling-yu Duan

Existing learning-based single image reflection removal methods using paired training data have fundamental limitations about the generalization capability on real-world reflections due to the limited variations in training pairs. In this work, we propose to jointly generate and separate reflections within a weakly-supervised learning framework, aiming to model the reflection image formation more comprehensively with abundant unpaired supervision. By imposing the adversarial losses and combinable mapping mechanism in a multi-task structure, the proposed framework elegantly integrates the two separate stages of reflection generation and separation into a unified model. The gradient constraint is incorporated into the concurrent training process of the multi-task learning as well. In particular, we built up an unpaired reflection dataset with 4,027 images, which is useful for facilitating the weakly-supervised learning of reflection removal model. Extensive experiments on a public benchmark dataset show that our framework performs favorably against state-of-the-art methods and consistently produces visually appealing results.

现有的基于学习的基于成对训练数据的单幅图像反射去除方法由于训练对的变化有限，对真实反射的泛化能力存在根本性的局限性。在这项工作中，我们提出在弱监督学习框架内共同生成和分离反射，旨在通过丰富的非成对监督更全面地模拟反射图像的形成。通过在多任务结构中施加对抗损失和组合映射机制，所提出的框架将反射生成和分离两个独立的阶段优雅地集成到一个统一的模型中。在多任务学习的并行训练过程中也引入了梯度约束。特别是，我们建立了一个包含4027张图像的未配对反射数据集，这有助于促进反射去除模型的弱监督学习。在公共基准数据集上进行的大量实验表明，我们的框架优于最先进的方法，并始终产生具有视觉吸引力的结果。

{"title":"Learning to Jointly Generate and Separate Reflections","authors":"Daiqian Ma, Renjie Wan, Boxin Shi, A. Kot, Ling-yu Duan","doi":"10.1109/ICCV.2019.00253","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00253","url":null,"abstract":"Existing learning-based single image reflection removal methods using paired training data have fundamental limitations about the generalization capability on real-world reflections due to the limited variations in training pairs. In this work, we propose to jointly generate and separate reflections within a weakly-supervised learning framework, aiming to model the reflection image formation more comprehensively with abundant unpaired supervision. By imposing the adversarial losses and combinable mapping mechanism in a multi-task structure, the proposed framework elegantly integrates the two separate stages of reflection generation and separation into a unified model. The gradient constraint is incorporated into the concurrent training process of the multi-task learning as well. In particular, we built up an unpaired reflection dataset with 4,027 images, which is useful for facilitating the weakly-supervised learning of reflection removal model. Extensive experiments on a public benchmark dataset show that our framework performs favorably against state-of-the-art methods and consistently produces visually appealing results.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"453 1","pages":"2444-2452"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82932267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style 探索类人认知风格下图像字幕的整体语境信息

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00184

H. Ge, Zehang Yan, Kai Zhang, Mingde Zhao, Liang Sun

Image captioning is a research hotspot where encoder-decoder models combining convolutional neural network (CNN) and long short-term memory (LSTM) achieve promising results. Despite significant progress, these models generate sentences differently from human cognitive styles. Existing models often generate a complete sentence from the first word to the end, without considering the influence of the following words on the whole sentence generation. In this paper, we explore the utilization of a human-like cognitive style, i.e., building overall cognition for the image to be described and the sentence to be constructed, for enhancing computer image understanding. This paper first proposes a Mutual-aid network structure with Bidirectional LSTMs (MaBi-LSTMs) for acquiring overall contextual information. In the training process, the forward and backward LSTMs encode the succeeding and preceding words into their respective hidden states by simultaneously constructing the whole sentence in a complementary manner. In the captioning process, the LSTM implicitly utilizes the subsequent semantic information contained in its hidden states. In fact, MaBi-LSTMs can generate two sentences in forward and backward directions. To bridge the gap between cross-domain models and generate a sentence with higher quality, we further develop a cross-modal attention mechanism to retouch the two sentences by fusing their salient parts as well as the salient areas of the image. Experimental results on the Microsoft COCO dataset show that the proposed model improves the performance of encoder-decoder models and achieves state-of-the-art results.

图像字幕是卷积神经网络(CNN)和长短期记忆(LSTM)相结合的编码器-解码器模型的研究热点，取得了可喜的成果。尽管取得了重大进展，但这些模型生成的句子与人类的认知风格不同。现有的模型经常从第一个单词到最后生成一个完整的句子，而没有考虑后面的单词对整个句子生成的影响。在本文中，我们探索利用一种类似人类的认知方式，即对待描述的图像和待构建的句子建立整体认知，以增强计算机图像理解。本文首先提出了一种利用双向lstm (mbi - lstm)获取整体上下文信息的互助网络结构。在训练过程中，前向lstm和后向lstm通过同时以互补的方式构建整个句子，将后继词和前继词编码到各自的隐藏状态。在标注过程中，LSTM隐式地利用隐藏状态中包含的后续语义信息。事实上，mabi - lstm可以生成正向和反向两个句子。为了弥合跨域模型之间的差距并生成更高质量的句子，我们进一步开发了一种跨模态注意机制，通过融合两个句子的突出部分以及图像的突出区域来修饰两个句子。在Microsoft COCO数据集上的实验结果表明，该模型提高了编码器-解码器模型的性能，达到了最先进的效果。

{"title":"Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style","authors":"H. Ge, Zehang Yan, Kai Zhang, Mingde Zhao, Liang Sun","doi":"10.1109/ICCV.2019.00184","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00184","url":null,"abstract":"Image captioning is a research hotspot where encoder-decoder models combining convolutional neural network (CNN) and long short-term memory (LSTM) achieve promising results. Despite significant progress, these models generate sentences differently from human cognitive styles. Existing models often generate a complete sentence from the first word to the end, without considering the influence of the following words on the whole sentence generation. In this paper, we explore the utilization of a human-like cognitive style, i.e., building overall cognition for the image to be described and the sentence to be constructed, for enhancing computer image understanding. This paper first proposes a Mutual-aid network structure with Bidirectional LSTMs (MaBi-LSTMs) for acquiring overall contextual information. In the training process, the forward and backward LSTMs encode the succeeding and preceding words into their respective hidden states by simultaneously constructing the whole sentence in a complementary manner. In the captioning process, the LSTM implicitly utilizes the subsequent semantic information contained in its hidden states. In fact, MaBi-LSTMs can generate two sentences in forward and backward directions. To bridge the gap between cross-domain models and generate a sentence with higher quality, we further develop a cross-modal attention mechanism to retouch the two sentences by fusing their salient parts as well as the salient areas of the image. Experimental results on the Microsoft COCO dataset show that the proposed model improves the performance of encoder-decoder models and achieves state-of-the-art results.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"1754-1763"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91257552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Zero-Shot Emotion Recognition via Affective Structural Embedding 基于情感结构嵌入的零概率情绪识别

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00124

Chi Zhan, Dongyu She, Sicheng Zhao, Ming-Ming Cheng, Jufeng Yang

Image emotion recognition attracts much attention in recent years due to its wide applications. It aims to classify the emotional response of humans, where candidate emotion categories are generally defined by specific psychological theories, such as Ekman’s six basic emotions. However, with the development of psychological theories, emotion categories become increasingly diverse, fine-grained, and difficult to collect samples. In this paper, we investigate zero-shot learning (ZSL) problem in the emotion recognition task, which tries to recognize the new unseen emotions. Specifically, we propose a novel affective-structural embedding framework, utilizing mid-level semantic representation, i.e., adjective-noun pairs (ANP) features, to construct an affective embedding space. By doing this, the learned intermediate space can narrow the semantic gap between low-level visual and high-level semantic features. In addition, we introduce an affective adversarial constraint to retain the discriminative capacity of visual features and the affective structural information of semantic features during training process. Our method is evaluated on five widely used affective datasets and the perimental results show the proposed algorithm outperforms the state-of-the-art approaches.

图像情感识别由于其广泛的应用，近年来受到了广泛的关注。它旨在对人类的情绪反应进行分类，其中候选情绪类别通常由特定的心理学理论定义，例如Ekman的六种基本情绪。然而，随着心理学理论的发展，情绪类别越来越多样化、细粒度化，样本采集难度加大。本文研究了情绪识别任务中的零次学习(zero-shot learning, ZSL)问题，该问题试图识别新的未见过的情绪。具体而言，我们提出了一种新的情感-结构嵌入框架，利用中级语义表示，即形容词-名词对(ANP)特征来构建情感嵌入空间。通过这样做，学习到的中间空间可以缩小低级视觉特征和高级语义特征之间的语义差距。此外，我们在训练过程中引入了情感对抗约束来保留视觉特征的判别能力和语义特征的情感结构信息。我们的方法在五个广泛使用的情感数据集上进行了评估，实验结果表明所提出的算法优于最先进的方法。

{"title":"Zero-Shot Emotion Recognition via Affective Structural Embedding","authors":"Chi Zhan, Dongyu She, Sicheng Zhao, Ming-Ming Cheng, Jufeng Yang","doi":"10.1109/ICCV.2019.00124","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00124","url":null,"abstract":"Image emotion recognition attracts much attention in recent years due to its wide applications. It aims to classify the emotional response of humans, where candidate emotion categories are generally defined by specific psychological theories, such as Ekman’s six basic emotions. However, with the development of psychological theories, emotion categories become increasingly diverse, fine-grained, and difficult to collect samples. In this paper, we investigate zero-shot learning (ZSL) problem in the emotion recognition task, which tries to recognize the new unseen emotions. Specifically, we propose a novel affective-structural embedding framework, utilizing mid-level semantic representation, i.e., adjective-noun pairs (ANP) features, to construct an affective embedding space. By doing this, the learned intermediate space can narrow the semantic gap between low-level visual and high-level semantic features. In addition, we introduce an affective adversarial constraint to retain the discriminative capacity of visual features and the affective structural information of semantic features during training process. Our method is evaluated on five widely used affective datasets and the perimental results show the proposed algorithm outperforms the state-of-the-art approaches.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"30 1","pages":"1151-1160"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90481512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks 基于对比评价网络的弱监督时间动作定位

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00400

Zi-yi Liu, Le Wang, Qilin Zhang, Zhanning Gao, Zhenxing Niu, N. Zheng, G. Hua

Weakly-supervised temporal action localization (WS-TAL) is a promising but challenging task with only video-level action categorical labels available during training. Without requiring temporal action boundary annotations in training data, WS-TAL could possibly exploit automatically retrieved video tags as video-level labels. However, such coarse video-level supervision inevitably incurs confusions, especially in untrimmed videos containing multiple action instances. To address this challenge, we propose the Contrast-based Localization EvaluAtioN Network (CleanNet) with our new action proposal evaluator, which provides pseudo-supervision by leveraging the temporal contrast in snippet-level action classification predictions. Essentially, the new action proposal evaluator enforces an additional temporal contrast constraint so that high-evaluation-score action proposals are more likely to coincide with true action instances. Moreover, the new action localization module is an integral part of CleanNet which enables end-to-end training. This is in contrast to many existing WS-TAL methods where action localization is merely a post-processing step. Experiments on THUMOS14 and ActivityNet datasets validate the efficacy of CleanNet against existing state-ofthe- art WS-TAL algorithms.

弱监督时态动作定位(WS-TAL)是一个很有前途但具有挑战性的任务，在训练过程中只有视频级别的动作分类标签可用。在训练数据中不需要临时动作边界注释，WS-TAL就可以利用自动检索的视频标记作为视频级标签。然而，这种粗糙的视频级监督不可避免地会造成混乱，特别是在包含多个动作实例的未经修剪的视频中。为了解决这一挑战，我们提出了基于对比的定位评估网络(CleanNet)和我们的新动作建议评估器，它通过利用片段级动作分类预测中的时间对比提供伪监督。从本质上讲，新的操作建议评估器强制执行了额外的时间对比约束，以便高评估分数的操作建议更有可能与真实的操作实例相一致。此外，新的动作定位模块是CleanNet的一个组成部分，可以实现端到端的培训。这与许多现有的WS-TAL方法形成对比，在这些方法中，操作本地化仅仅是一个后处理步骤。在THUMOS14和ActivityNet数据集上的实验验证了CleanNet对现有最先进的WS-TAL算法的有效性。

{"title":"Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks","authors":"Zi-yi Liu, Le Wang, Qilin Zhang, Zhanning Gao, Zhenxing Niu, N. Zheng, G. Hua","doi":"10.1109/ICCV.2019.00400","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00400","url":null,"abstract":"Weakly-supervised temporal action localization (WS-TAL) is a promising but challenging task with only video-level action categorical labels available during training. Without requiring temporal action boundary annotations in training data, WS-TAL could possibly exploit automatically retrieved video tags as video-level labels. However, such coarse video-level supervision inevitably incurs confusions, especially in untrimmed videos containing multiple action instances. To address this challenge, we propose the Contrast-based Localization EvaluAtioN Network (CleanNet) with our new action proposal evaluator, which provides pseudo-supervision by leveraging the temporal contrast in snippet-level action classification predictions. Essentially, the new action proposal evaluator enforces an additional temporal contrast constraint so that high-evaluation-score action proposals are more likely to coincide with true action instances. Moreover, the new action localization module is an integral part of CleanNet which enables end-to-end training. This is in contrast to many existing WS-TAL methods where action localization is merely a post-processing step. Experiments on THUMOS14 and ActivityNet datasets validate the efficacy of CleanNet against existing state-ofthe- art WS-TAL algorithms.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"106 1","pages":"3898-3907"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88059842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 102

Joint Optimization for Cooperative Image Captioning 协同图像字幕的联合优化

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00899

Gilad Vered, Gal Oren, Y. Atzmon, Gal Chechik

When describing images with natural language, descriptions can be made more informative if tuned for downstream tasks. This can be achieved by training two networks: a "speaker" that generates sentences given an image and a "listener" that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate, faces two major challenges. First, the descriptions generated by a speaker network are discrete and stochastic, making optimization very hard and inefficient. Second, joint training usually causes the vocabulary used during communication to drift and diverge from natural language. To address these challenges, we present an effective optimization technique based on partial-sampling from a multinomial distribution combined with straight-through gradient updates, which we name PSST for Partial-Sampling Straight-Through. We then show that the generated descriptions can be kept close to natural by constraining them to be similar to human descriptions. Together, this approach creates descriptions that are both more discriminative and more natural than previous approaches. Evaluations on the COCO benchmark show that PSST improve the recall@10 from 60% to 86% maintaining comparable language naturalness. Human evaluations show that it also increases naturalness while keeping the discriminative power of generated captions.

当使用自然语言描述图像时，如果针对下游任务进行了调整，则描述可以提供更多信息。这可以通过训练两个网络来实现:一个“说话者”根据图像生成句子，另一个“听者”使用图像执行任务。不幸的是，训练多个网络共同进行通信，面临两大挑战。首先，扬声器网络生成的描述是离散的和随机的，使得优化非常困难和低效。其次，联合训练通常会导致交流中使用的词汇与自然语言发生漂移和偏离。为了解决这些挑战，我们提出了一种有效的优化技术，该技术基于多项分布的部分抽样结合直通式梯度更新，我们将其命名为PSST (partial-sampling straight-through)。然后，我们展示了生成的描述可以通过约束它们与人类描述相似来保持接近自然。总之，这种方法创建的描述比以前的方法更有辨别力，也更自然。对COCO基准的评估表明，PSST将recall@10从60%提高到86%，保持了相当的语言自然度。人类的评估表明，它也增加了自然度，同时保持了生成的字幕的辨别力。

{"title":"Joint Optimization for Cooperative Image Captioning","authors":"Gilad Vered, Gal Oren, Y. Atzmon, Gal Chechik","doi":"10.1109/ICCV.2019.00899","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00899","url":null,"abstract":"When describing images with natural language, descriptions can be made more informative if tuned for downstream tasks. This can be achieved by training two networks: a \"speaker\" that generates sentences given an image and a \"listener\" that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate, faces two major challenges. First, the descriptions generated by a speaker network are discrete and stochastic, making optimization very hard and inefficient. Second, joint training usually causes the vocabulary used during communication to drift and diverge from natural language. To address these challenges, we present an effective optimization technique based on partial-sampling from a multinomial distribution combined with straight-through gradient updates, which we name PSST for Partial-Sampling Straight-Through. We then show that the generated descriptions can be kept close to natural by constraining them to be similar to human descriptions. Together, this approach creates descriptions that are both more discriminative and more natural than previous approaches. Evaluations on the COCO benchmark show that PSST improve the recall@10 from 60% to 86% maintaining comparable language naturalness. Human evaluations show that it also increases naturalness while keeping the discriminative power of generated captions.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"19 1","pages":"8897-8906"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85797483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

TRB: A Novel Triplet Representation for Understanding 2D Human Body TRB:一种用于理解二维人体的新型三重表示

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00957

Haodong Duan, Kwan-Yee Lin, Sheng Jin, Wentao Liu, C. Qian, Wanli Ouyang

Human pose and shape are two important components of 2D human body. However, how to efficiently represent both of them in images is still an open question. In this paper, we propose the Triplet Representation for Body (TRB) --- a compact 2D human body representation, with skeleton keypoints capturing human pose information and contour keypoints containing human shape information. TRB not only preserves the flexibility of skeleton keypoint representation, but also contains rich pose and human shape information. Therefore, it promises broader application areas, such as human shape editing and conditional image generation. We further introduce the challenging problem of TRB estimation, where joint learning of human pose and shape is required. We construct several large-scale TRB estimation datasets, based on the popular 2D pose datasets LSP, MPII and COCO. To effectively solve TRB estimation, we propose a two-branch network (TRB-net) with three novel techniques, namely X-structure (Xs), Directional Convolution (DC) and Pairwise mapping (PM), to enforce multi-level message passing for joint feature learning. We evaluate our proposed TRB-net and several leading approaches on our proposed TRB datasets, and demonstrate the superiority of our method through extensive evaluations.

人体姿态和形体是二维人体的两个重要组成部分。然而，如何在图像中有效地表示两者仍然是一个悬而未决的问题。在本文中，我们提出了身体的三重表示(TRB)——一种紧凑的二维人体表示，其中骨架关键点捕获人体姿势信息，轮廓关键点包含人体形状信息。TRB既保留了骨架关键点表示的灵活性，又包含了丰富的姿态和人体形状信息。因此，它具有更广泛的应用领域，如人体形状编辑和条件图像生成。我们进一步介绍了具有挑战性的TRB估计问题，其中需要联合学习人体姿势和形状。基于当前流行的二维姿态数据集LSP、MPII和COCO，我们构建了几个大规模的TRB估计数据集。为了有效地解决TRB估计问题，我们提出了一个两分支网络(TRB-net)，采用三种新技术，即x结构(Xs)，定向卷积(DC)和成对映射(PM)，以强制多级消息传递以进行联合特征学习。我们在我们提出的TRB数据集上评估了我们提出的TRB网络和几种领先的方法，并通过广泛的评估证明了我们方法的优越性。

{"title":"TRB: A Novel Triplet Representation for Understanding 2D Human Body","authors":"Haodong Duan, Kwan-Yee Lin, Sheng Jin, Wentao Liu, C. Qian, Wanli Ouyang","doi":"10.1109/ICCV.2019.00957","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00957","url":null,"abstract":"Human pose and shape are two important components of 2D human body. However, how to efficiently represent both of them in images is still an open question. In this paper, we propose the Triplet Representation for Body (TRB) --- a compact 2D human body representation, with skeleton keypoints capturing human pose information and contour keypoints containing human shape information. TRB not only preserves the flexibility of skeleton keypoint representation, but also contains rich pose and human shape information. Therefore, it promises broader application areas, such as human shape editing and conditional image generation. We further introduce the challenging problem of TRB estimation, where joint learning of human pose and shape is required. We construct several large-scale TRB estimation datasets, based on the popular 2D pose datasets LSP, MPII and COCO. To effectively solve TRB estimation, we propose a two-branch network (TRB-net) with three novel techniques, namely X-structure (Xs), Directional Convolution (DC) and Pairwise mapping (PM), to enforce multi-level message passing for joint feature learning. We evaluate our proposed TRB-net and several leading approaches on our proposed TRB datasets, and demonstrate the superiority of our method through extensive evaluations.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"70 1","pages":"9478-9487"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86314319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Wasserstein GAN With Quadratic Transport Cost 具有二次运输成本的Wasserstein GAN

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00493

Huidong Liu, X. Gu, D. Samaras

Wasserstein GANs are increasingly used in Computer Vision applications as they are easier to train. Previous WGAN variants mainly use the $l_1$ transport cost to compute the Wasserstein distance between the real and synthetic data distributions. The $l_1$ transport cost restricts the discriminator to be 1-Lipschitz. However, WGANs with $l_1$ transport cost were recently shown to not always converge. In this paper, we propose WGAN-QC, a WGAN with quadratic transport cost. Based on the quadratic transport cost, we propose an Optimal Transport Regularizer (OTR) to stabilize the training process of WGAN-QC. We prove that the objective of the discriminator during each generator update computes the exact quadratic Wasserstein distance between real and synthetic data distributions. We also prove that WGAN-QC converges to a local equilibrium point with finite discriminator updates per generator update. We show experimentally on a Dirac distribution that WGAN-QC converges, when many of the $l_1$ cost WGANs fail to [22]. Qualitative and quantitative results on the CelebA, CelebA-HQ, LSUN and the ImageNet dog datasets show that WGAN-QC is better than state-of-art GAN methods. WGAN-QC has much faster runtime than other WGAN variants.

Wasserstein gan越来越多地用于计算机视觉应用，因为它们更容易训练。以前的WGAN变体主要使用$l_1$传输成本来计算真实数据分布与合成数据分布之间的Wasserstein距离。$l_1$运输成本限制了鉴别器为1-Lipschitz。然而，最近的研究表明，具有$ l1 $运输成本的wgan并不总是收敛的。本文提出了一种具有二次传输成本的WGAN- qc。基于二次传输代价，提出了一种最优传输正则器(OTR)来稳定WGAN-QC的训练过程。我们证明了鉴别器在每次更新生成器时的目标是计算真实数据分布和合成数据分布之间精确的二次Wasserstein距离。我们还证明了WGAN-QC收敛于一个局部平衡点，每次发电机更新有限的鉴别器更新。我们在Dirac分布上通过实验证明，当许多$ l1 $成本的wgan不能收敛时，WGAN-QC收敛[22]。在CelebA, CelebA- hq, LSUN和ImageNet狗数据集上的定性和定量结果表明，WGAN-QC优于最先进的GAN方法。WGAN- qc比其他WGAN变体具有更快的运行时间。

{"title":"Wasserstein GAN With Quadratic Transport Cost","authors":"Huidong Liu, X. Gu, D. Samaras","doi":"10.1109/ICCV.2019.00493","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00493","url":null,"abstract":"Wasserstein GANs are increasingly used in Computer Vision applications as they are easier to train. Previous WGAN variants mainly use the $l_1$ transport cost to compute the Wasserstein distance between the real and synthetic data distributions. The $l_1$ transport cost restricts the discriminator to be 1-Lipschitz. However, WGANs with $l_1$ transport cost were recently shown to not always converge. In this paper, we propose WGAN-QC, a WGAN with quadratic transport cost. Based on the quadratic transport cost, we propose an Optimal Transport Regularizer (OTR) to stabilize the training process of WGAN-QC. We prove that the objective of the discriminator during each generator update computes the exact quadratic Wasserstein distance between real and synthetic data distributions. We also prove that WGAN-QC converges to a local equilibrium point with finite discriminator updates per generator update. We show experimentally on a Dirac distribution that WGAN-QC converges, when many of the $l_1$ cost WGANs fail to [22]. Qualitative and quantitative results on the CelebA, CelebA-HQ, LSUN and the ImageNet dog datasets show that WGAN-QC is better than state-of-art GAN methods. WGAN-QC has much faster runtime than other WGAN variants.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"24 1","pages":"4831-4840"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87517106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 70

Image Aesthetic Assessment Based on Pairwise Comparison A Unified Approach to Score Regression, Binary Classification, and Personalization 基于两两比较的图像美学评价——分数回归、二元分类和个性化的统一方法

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00128

Jun-Tae Lee, Chang-Su Kim

We propose a unified approach to three tasks of aesthetic score regression, binary aesthetic classification, and personalized aesthetics. First, we develop a comparator to estimate the ratio of aesthetic scores for two images. Then, we construct a pairwise comparison matrix for multiple reference images and an input image, and predict the aesthetic score of the input via the eigenvalue decomposition of the matrix. By varying the reference images, the proposed algorithm can be used for binary aesthetic classification and personalized aesthetics, as well as generic score regression. Experimental results demonstrate that the proposed unified algorithm provides the state-of-the-art performances in all three tasks of image aesthetics.

我们提出了统一的方法来完成美学评分回归、二元美学分类和个性化美学三项任务。首先，我们开发了一个比较器来估计两个图像的美学分数的比率。然后，我们构建了多个参考图像和一个输入图像的两两比较矩阵，并通过矩阵的特征值分解来预测输入图像的审美评分。通过改变参考图像，该算法可以用于二元美学分类和个性化美学，也可以用于通用评分回归。实验结果表明，所提出的统一算法在图像美学的三个任务中都提供了最先进的性能。

引用次数: 38

CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation CDPN:基于坐标的解纠缠姿态网络，用于实时基于rgb的六自由度目标姿态估计

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00777

Zhigang Li, Gu Wang, Xiangyang Ji

6-DoF object pose estimation from a single RGB image is a fundamental and long-standing problem in computer vision. Current leading approaches solve it by training deep networks to either regress both rotation and translation from image directly or to construct 2D-3D correspondences and further solve them via PnP indirectly. We argue that rotation and translation should be treated differently for their significant difference. In this work, we propose a novel 6-DoF pose estimation approach: Coordinates-based Disentangled Pose Network (CDPN), which disentangles the pose to predict rotation and translation separately to achieve highly accurate and robust pose estimation. Our method is flexible, efficient, highly accurate and can deal with texture-less and occluded objects. Extensive experiments on LINEMOD and Occlusion datasets are conducted and demonstrate the superiority of our approach. Concretely, our approach significantly exceeds the state-of-the- art RGB-based methods on commonly used metrics.

从单个RGB图像中估计六自由度目标姿态是计算机视觉中一个基本且长期存在的问题。目前的主要方法是通过训练深度网络来直接从图像中回归旋转和平移，或者构建2D-3D对应关系，并通过PnP间接地进一步解决它们。我们认为，旋转和平移应区别对待，因为它们的显著差异。在这项工作中，我们提出了一种新的六自由度姿态估计方法:基于坐标的解纠缠姿态网络(CDPN)，该方法将姿态解纠缠分别预测旋转和平移，以实现高精度和鲁棒的姿态估计。该方法灵活、高效、精度高，可以处理无纹理和遮挡的物体。在LINEMOD和Occlusion数据集上进行了大量的实验，证明了我们方法的优越性。具体地说，我们的方法在常用的度量标准上明显超过了基于rgb的最先进的方法。

引用次数: 279