2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文中文

Explainability Methods for Graph Convolutional Neural Networks 图卷积神经网络的可解释性方法

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.01103

Phillip E. Pope, Soheil Kolouri, Mohammad Rostami, Charles E. Martin, Heiko Hoffmann

With the growing use of graph convolutional neural networks (GCNNs) comes the need for explainability. In this paper, we introduce explainability methods for GCNNs. We develop the graph analogues of three prominent explainability methods for convolutional neural networks: contrastive gradient-based (CG) saliency maps, Class Activation Mapping (CAM), and Excitation Back-Propagation (EB) and their variants, gradient-weighted CAM (Grad-CAM) and contrastive EB (c-EB). We show a proof-of-concept of these methods on classification problems in two application domains: visual scene graphs and molecular graphs. To compare the methods, we identify three desirable properties of explanations: (1) their importance to classification, as measured by the impact of occlusions, (2) their contrastivity with respect to different classes, and (3) their sparseness on a graph. We call the corresponding quantitative metrics fidelity, contrastivity, and sparsity and evaluate them for each method. Lastly, we analyze the salient subgraphs obtained from explanations and report frequently occurring patterns.

随着图卷积神经网络(GCNNs)的应用越来越广泛，对可解释性的需求也随之而来。在本文中，我们介绍了gcnn的可解释性方法。我们开发了卷积神经网络三种突出的可解释性方法的图类似物:基于梯度的对比显著性图(CG)，类激活映射(CAM)和激励反向传播(EB)及其变体，梯度加权的CAM (Grad-CAM)和对比EB (c-EB)。我们在视觉场景图和分子图两个应用领域展示了这些方法在分类问题上的概念证明。为了比较这些方法，我们确定了三个理想的解释属性:(1)它们对分类的重要性，通过遮挡的影响来衡量，(2)它们相对于不同类别的对比性，以及(3)它们在图上的稀疏性。我们将相应的定量度量称为保真度、对比性和稀疏性，并对每种方法进行评估。最后，我们分析了从解释中得到的显著子图，并报告了频繁出现的模式。

{"title":"Explainability Methods for Graph Convolutional Neural Networks","authors":"Phillip E. Pope, Soheil Kolouri, Mohammad Rostami, Charles E. Martin, Heiko Hoffmann","doi":"10.1109/CVPR.2019.01103","DOIUrl":"https://doi.org/10.1109/CVPR.2019.01103","url":null,"abstract":"With the growing use of graph convolutional neural networks (GCNNs) comes the need for explainability. In this paper, we introduce explainability methods for GCNNs. We develop the graph analogues of three prominent explainability methods for convolutional neural networks: contrastive gradient-based (CG) saliency maps, Class Activation Mapping (CAM), and Excitation Back-Propagation (EB) and their variants, gradient-weighted CAM (Grad-CAM) and contrastive EB (c-EB). We show a proof-of-concept of these methods on classification problems in two application domains: visual scene graphs and molecular graphs. To compare the methods, we identify three desirable properties of explanations: (1) their importance to classification, as measured by the impact of occlusions, (2) their contrastivity with respect to different classes, and (3) their sparseness on a graph. We call the corresponding quantitative metrics fidelity, contrastivity, and sparsity and evaluate them for each method. Lastly, we analyze the salient subgraphs obtained from explanations and report frequently occurring patterns.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"10764-10773"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83481975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 326

Residual Regression With Semantic Prior for Crowd Counting 基于语义先验的残差回归人群计数

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.00416

Jia Wan, Wenhan Luo, Baoyuan Wu, Antoni B. Chan, Wei Liu

Crowd counting is a challenging task due to factors such as large variations in crowdedness and severe occlusions. Although recent deep learning based counting algorithms have achieved a great progress, the correlation knowledge among samples and the semantic prior have not yet been fully exploited. In this paper, a residual regression framework is proposed for crowd counting utilizing the correlation information among samples. By incorporating such information into our network, we discover that more intrinsic characteristics can be learned by the network which thus generalizes better to unseen scenarios. Besides, we show how to effectively leverage the semantic prior to improve the performance of crowd counting. We also observe that the adversarial loss can be used to improve the quality of predicted density maps, thus leading to an improvement in crowd counting. Experiments on public datasets demonstrate the effectiveness and generalization ability of the proposed method.

由于拥挤程度的巨大变化和严重的闭塞等因素，人群计数是一项具有挑战性的任务。尽管近年来基于深度学习的计数算法取得了很大进展，但样本间的相关知识和语义先验尚未得到充分利用。本文提出了一个残差回归框架，利用样本间的相关信息进行人群计数。通过将这些信息整合到我们的网络中，我们发现网络可以学习到更多的内在特征，从而更好地推广到看不见的场景。此外，我们还展示了如何有效地利用语义先验来提高人群计数的性能。我们还观察到，对抗损失可以用来提高预测密度图的质量，从而导致人群计数的改进。在公共数据集上的实验证明了该方法的有效性和泛化能力。

引用次数: 100

Monocular Depth Estimation Using Relative Depth Maps 使用相对深度图的单目深度估计

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.00996

Jae-Han Lee, Chang-Su Kim

We propose a novel algorithm for monocular depth estimation using relative depth maps. First, using a convolutional neural network, we estimate relative depths between pairs of regions, as well as ordinary depths, at various scales. Second, we restore relative depth maps from selectively estimated data based on the rank-1 property of pairwise comparison matrices. Third, we decompose ordinary and relative depth maps into components and recombine them optimally to reconstruct a final depth map. Experimental results show that the proposed algorithm provides the state-of-art depth estimation performance.

提出了一种利用相对深度图进行单目深度估计的新算法。首先，使用卷积神经网络，我们在不同的尺度上估计区域对之间的相对深度，以及普通深度。其次，基于两两比较矩阵的rank-1属性，从选择性估计的数据中恢复相对深度图。第三，将普通深度图和相对深度图分解为多个分量，并进行优化重组，重建最终深度图。实验结果表明，该算法具有较好的深度估计性能。

引用次数: 105

Deeply-Supervised Knowledge Synergy 深度监督知识协同

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.00716

Dawei Sun, Anbang Yao, Aojun Zhou, Hao Zhao

Convolutional Neural Networks (CNNs) have become deeper and more complicated compared with the pioneering AlexNet. However, current prevailing training scheme follows the previous way of adding supervision to the last layer of the network only and propagating error information up layer-by-layer. In this paper, we propose Deeply-supervised Knowledge Synergy (DKS), a new method aiming to train CNNs with improved generalization ability for image classification tasks without introducing extra computational cost during inference. Inspired by the deeply-supervised learning scheme, we first append auxiliary supervision branches on top of certain intermediate network layers. While properly using auxiliary supervision can improve model accuracy to some degree, we go one step further to explore the possibility of utilizing the probabilistic knowledge dynamically learnt by the classifiers connected to the backbone network as a new regularization to improve the training. A novel synergy loss, which considers pairwise knowledge matching among all supervision branches, is presented. Intriguingly, it enables dense pairwise knowledge matching operations in both top-down and bottom-up directions at each training iteration, resembling a dynamic synergy process for the same task. We evaluate DKS on image classification datasets using state-of-the-art CNN architectures, and show that the models trained with it are consistently better than the corresponding counterparts. For instance, on the ImageNet classification benchmark, our ResNet-152 model outperforms the baseline model with a 1.47% margin in Top-1 accuracy. Code is available at https://github.com/sundw2014/DKS.

与先驱AlexNet相比，卷积神经网络(cnn)已经变得更加深入和复杂。然而，目前流行的训练方案沿用了之前的方法，即只在网络的最后一层增加监督，逐层传播错误信息。在本文中，我们提出了深度监督知识协同(deep -supervised Knowledge Synergy, DKS)，这是一种新的方法，旨在训练具有更高泛化能力的cnn来完成图像分类任务，而不会在推理过程中引入额外的计算成本。受深度监督学习方案的启发，我们首先在某些中间网络层上附加辅助监督分支。适当地使用辅助监督可以在一定程度上提高模型的准确性，我们进一步探索了利用连接到骨干网络的分类器动态学习的概率知识作为一种新的正则化来改进训练的可能性。提出了一种考虑各监管分支间知识配对匹配的新型协同损失算法。有趣的是，在每个训练迭代中，它支持自上而下和自下而上方向上的密集成对知识匹配操作，类似于同一任务的动态协同过程。我们使用最先进的CNN架构在图像分类数据集上评估DKS，并表明使用它训练的模型始终优于相应的模型。例如，在ImageNet分类基准上，我们的ResNet-152模型在Top-1准确率上比基线模型高出1.47%。代码可从https://github.com/sundw2014/DKS获得。

{"title":"Deeply-Supervised Knowledge Synergy","authors":"Dawei Sun, Anbang Yao, Aojun Zhou, Hao Zhao","doi":"10.1109/CVPR.2019.00716","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00716","url":null,"abstract":"Convolutional Neural Networks (CNNs) have become deeper and more complicated compared with the pioneering AlexNet. However, current prevailing training scheme follows the previous way of adding supervision to the last layer of the network only and propagating error information up layer-by-layer. In this paper, we propose Deeply-supervised Knowledge Synergy (DKS), a new method aiming to train CNNs with improved generalization ability for image classification tasks without introducing extra computational cost during inference. Inspired by the deeply-supervised learning scheme, we first append auxiliary supervision branches on top of certain intermediate network layers. While properly using auxiliary supervision can improve model accuracy to some degree, we go one step further to explore the possibility of utilizing the probabilistic knowledge dynamically learnt by the classifiers connected to the backbone network as a new regularization to improve the training. A novel synergy loss, which considers pairwise knowledge matching among all supervision branches, is presented. Intriguingly, it enables dense pairwise knowledge matching operations in both top-down and bottom-up directions at each training iteration, resembling a dynamic synergy process for the same task. We evaluate DKS on image classification datasets using state-of-the-art CNN architectures, and show that the models trained with it are consistently better than the corresponding counterparts. For instance, on the ImageNet classification benchmark, our ResNet-152 model outperforms the baseline model with a 1.47% margin in Top-1 accuracy. Code is available at https://github.com/sundw2014/DKS.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"61 1","pages":"6990-6999"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90591628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Predicting Visible Image Differences Under Varying Display Brightness and Viewing Distance 在不同显示亮度和观看距离下预测可见图像差异

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.00558

Nanyang Ye, Krzysztof Wolski, Rafał K. Mantiuk

Numerous applications require a robust metric that can predict whether image differences are visible or not. However, the accuracy of existing white-box visibility metrics, such as HDR-VDP, is often not good enough. CNN-based black-box visibility metrics have proven to be more accurate, but they cannot account for differences in viewing conditions, such as display brightness and viewing distance. In this paper, we propose a CNN-based visibility metric, which maintains the accuracy of deep network solutions and accounts for viewing conditions. To achieve this, we extend the existing dataset of locally visible differences (LocVis) with a new set of measurements, collected considering aforementioned viewing conditions. Then, we develop a hybrid model that combines white-box processing stages for modeling the effects of luminance masking and contrast sensitivity, with a black-box deep neural network. We demonstrate that the novel hybrid model can handle the change of viewing conditions correctly and outperforms state-of-the-art metrics.

许多应用程序需要一个健壮的度量来预测图像差异是否可见。然而，现有的白盒可见性指标(如HDR-VDP)的准确性往往不够好。基于cnn的黑箱可视性指标已被证明更为准确，但它们无法解释观看条件的差异，例如显示亮度和观看距离。在本文中，我们提出了一个基于cnn的可见性度量，它保持了深度网络解决方案的准确性，并考虑了观看条件。为了实现这一目标，我们使用一组新的测量值扩展了现有的局部可见差异(LocVis)数据集，这些测量值是根据上述观看条件收集的。然后，我们开发了一个混合模型，将白盒处理阶段与黑盒深度神经网络相结合，用于模拟亮度掩蔽和对比度灵敏度的影响。我们证明了新的混合模型可以正确地处理观看条件的变化，并且优于最先进的指标。

引用次数: 11

Learning Words by Drawing Images 通过画图来学习单词

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.00213

Dídac Surís, Adrià Recasens, David Bau, David F. Harwath, James R. Glass, A. Torralba

We propose a framework for learning through drawing. Our goal is to learn the correspondence between spoken words and abstract visual attributes, from a dataset of spoken descriptions of images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, we propose a new method to use such GAN-generated images to train a model using a triplet loss. To apply the method, we develop Audio CLEVRGAN, a new dataset of audio descriptions of GAN-generated CLEVR images, and we describe a training procedure that creates a curriculum of GAN-generated images that focuses training on image pairs that differ in a specific, informative way. Training is done without additional supervision beyond the spoken captions and the GAN. We find that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results. Our proposed learning framework also results in models that can associate spoken words with some abstract visual concepts such as color and size.

我们提出了一个通过绘画学习的框架。我们的目标是从图像的口头描述数据集中学习口语单词和抽象视觉属性之间的对应关系。基于最近的发现，GAN表示可以被操纵来编辑生成输出中的语义概念，我们提出了一种使用GAN生成的图像来使用三重损失训练模型的新方法。为了应用该方法，我们开发了Audio CLEVRGAN，这是一个gan生成的CLEVR图像的音频描述的新数据集，我们描述了一个训练过程，该过程创建了gan生成的图像课程，该课程侧重于以特定的、信息丰富的方式对不同的图像对进行训练。训练是在没有额外监督的情况下完成的，除了口语字幕和GAN。我们发现，与之前的结果相比，利用gan生成的编辑示例的训练可以提高模型学习属性的能力。我们提出的学习框架还产生了一些模型，这些模型可以将口语与一些抽象的视觉概念(如颜色和大小)联系起来。

{"title":"Learning Words by Drawing Images","authors":"Dídac Surís, Adrià Recasens, David Bau, David F. Harwath, James R. Glass, A. Torralba","doi":"10.1109/CVPR.2019.00213","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00213","url":null,"abstract":"We propose a framework for learning through drawing. Our goal is to learn the correspondence between spoken words and abstract visual attributes, from a dataset of spoken descriptions of images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, we propose a new method to use such GAN-generated images to train a model using a triplet loss. To apply the method, we develop Audio CLEVRGAN, a new dataset of audio descriptions of GAN-generated CLEVR images, and we describe a training procedure that creates a curriculum of GAN-generated images that focuses training on image pairs that differ in a specific, informative way. Training is done without additional supervision beyond the spoken captions and the GAN. We find that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results. Our proposed learning framework also results in models that can associate spoken words with some abstract visual concepts such as color and size.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"213 1 1","pages":"2029-2038"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85642682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

BAD SLAM: Bundle Adjusted Direct RGB-D SLAM 坏SLAM:束调整直接RGB-D SLAM

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.00022

Thomas Schöps, Torsten Sattler, M. Pollefeys

A key component of Simultaneous Localization and Mapping (SLAM) systems is the joint optimization of the estimated 3D map and camera trajectory. Bundle adjustment (BA) is the gold standard for this. Due to the large number of variables in dense RGB-D SLAM, previous work has focused on approximating BA. In contrast, in this paper we present a novel, fast direct BA formulation which we implement in a real-time dense RGB-D SLAM algorithm. In addition, we show that direct RGB-D SLAM systems are highly sensitive to rolling shutter, RGB and depth sensor synchronization, and calibration errors. In order to facilitate state-of-the-art research on direct RGB-D SLAM, we propose a novel, well-calibrated benchmark for this task that uses synchronized global shutter RGB and depth cameras. It includes a training set, a test set without public ground truth, and an online evaluation service. We observe that the ranking of methods changes on this dataset compared to existing ones, and our proposed algorithm outperforms all other evaluated SLAM methods. Our benchmark and our open source SLAM algorithm are available at: www.eth3d.net

同时定位与测绘(SLAM)系统的一个关键组成部分是估计的三维地图和相机轨迹的联合优化。捆绑调整(BA)是这方面的黄金标准。由于密集RGB-D SLAM中存在大量变量，以往的工作主要集中在近似BA上。相比之下，在本文中，我们提出了一种新的，快速的直接BA公式，我们在实时密集RGB-D SLAM算法中实现。此外，我们发现直接RGB- d SLAM系统对滚动快门、RGB和深度传感器同步以及校准误差高度敏感。为了促进对直接RGB- d SLAM的最新研究，我们提出了一种新的、校准良好的基准，该基准使用同步全局快门RGB和深度相机。它包括一个训练集，一个没有公开基础真理的测试集，以及一个在线评估服务。我们观察到，与现有方法相比，该数据集上方法的排名发生了变化，并且我们提出的算法优于所有其他已评估的SLAM方法。我们的基准测试和开源SLAM算法可在:www.eth3d.net上获得

引用次数: 174

An Alternative Deep Feature Approach to Line Level Keyword Spotting 一种行级关键字定位的深度特征替代方法

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.01294

George Retsinas, G. Louloudis, N. Stamatopoulos, Giorgos Sfikas, B. Gatos

Keyword spotting (KWS) is defined as the problem of detecting all instances of a given word, provided by the user either as a query word image (Query-by-Example, QbE) or a query word string (Query-by-String, QbS) in a body of digitized documents. Keyword detection is typically preceded by a preprocessing step where the text is segmented into text lines (line-level KWS). Methods following this paradigm are monopolized by test-time computationally expensive handwritten text recognition (HTR)-based approaches; furthermore, they typically cannot handle image queries (QbE). In this work, we propose a time and storage-efficient, deep feature-based approach that enables both the image and textual search options. Three distinct components, all modeled as neural networks, are combined: normalization, feature extraction and representation of image and textual input into a common space. These components, even if designed on word level image representations, collaborate in order to achieve an efficient line level keyword spotting system. The experimental results indicate that the proposed system is on par with state-of-the-art KWS methods.

关键字查找(KWS)被定义为检测给定单词的所有实例的问题，这些单词由用户在数字化文档主体中以查询词图像(按示例查询)或查询词字符串(按字符串查询)的形式提供。关键字检测之前通常有一个预处理步骤，其中文本被分割成文本行(行级KWS)。遵循这种范式的方法被基于测试时间的计算昂贵的手写文本识别(HTR)方法所垄断;此外，它们通常不能处理图像查询(QbE)。在这项工作中，我们提出了一种时间和存储效率高、基于深度特征的方法，该方法支持图像和文本搜索选项。三个不同的组成部分，所有建模为神经网络，结合:归一化，特征提取和表示图像和文本输入到一个共同的空间。这些组件，即使设计在字级图像表示，协作，以实现一个有效的行级关键字定位系统。实验结果表明，所提出的系统与最先进的KWS方法相当。

{"title":"An Alternative Deep Feature Approach to Line Level Keyword Spotting","authors":"George Retsinas, G. Louloudis, N. Stamatopoulos, Giorgos Sfikas, B. Gatos","doi":"10.1109/CVPR.2019.01294","DOIUrl":"https://doi.org/10.1109/CVPR.2019.01294","url":null,"abstract":"Keyword spotting (KWS) is defined as the problem of detecting all instances of a given word, provided by the user either as a query word image (Query-by-Example, QbE) or a query word string (Query-by-String, QbS) in a body of digitized documents. Keyword detection is typically preceded by a preprocessing step where the text is segmented into text lines (line-level KWS). Methods following this paradigm are monopolized by test-time computationally expensive handwritten text recognition (HTR)-based approaches; furthermore, they typically cannot handle image queries (QbE). In this work, we propose a time and storage-efficient, deep feature-based approach that enables both the image and textual search options. Three distinct components, all modeled as neural networks, are combined: normalization, feature extraction and representation of image and textual input into a common space. These components, even if designed on word level image representations, collaborate in order to achieve an efficient line level keyword spotting system. The experimental results indicate that the proposed system is on par with state-of-the-art KWS methods.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"12650-12658"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90595454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning 基于多尺度时空推理的视频社会关系识别

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.00368

Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, C. Yan, Tao Mei

Discovering social relations, e.g., kinship, friendship, etc., from visual contents can make machines better interpret the behaviors and emotions of human beings. Existing studies mainly focus on recognizing social relations from still images while neglecting another important media--video. On one hand, the actions and storylines in videos provide more important cues for social relation recognition. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same image from beginning to the end. To overcome these challenges, we propose a Multi-scale Spatial-Temporal Reasoning (MSTR) framework to recognize social relations from videos. For the spatial representation, we not only adopt a temporal segment network to learn global action and scene information, but also design a Triple Graphs model to capture visual relations between persons and objects. For the temporal domain, we propose a Pyramid Graph Convolutional Network to perform temporal reasoning with multi-scale receptive fields, which can obtain both long-term and short-term storylines in videos. By this means, MSTR can comprehensively explore the multi-scale actions and storylines in spatial-temporal dimensions for social relation reasoning in videos. Extensive experiments on a new large-scale Video Social Relation dataset demonstrate the effectiveness of the proposed framework.

从视觉内容中发现社会关系，如亲情、友谊等，可以让机器更好地解读人类的行为和情感。现有的研究主要集中在从静止图像中识别社会关系，而忽略了另一个重要的媒体——视频。一方面，视频中的动作和故事情节为社会关系识别提供了更重要的线索。另一方面，关键人物可能出现在任意的时空位置，甚至从头到尾都不在同一张图像中。为了克服这些挑战，我们提出了一个多尺度时空推理(MSTR)框架来识别视频中的社会关系。在空间表示方面，我们不仅采用时间段网络来学习全局动作和场景信息，还设计了一个三重图模型来捕捉人与物体之间的视觉关系。在时间域，我们提出了一个金字塔图卷积网络来进行多尺度感受域的时间推理，可以同时获得视频中的长期和短期故事情节。通过这种方式，MSTR可以在时空维度上全面探索视频中社会关系推理的多尺度动作和故事情节。在一个新的大规模视频社会关系数据集上的大量实验证明了该框架的有效性。

{"title":"Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning","authors":"Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, C. Yan, Tao Mei","doi":"10.1109/CVPR.2019.00368","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00368","url":null,"abstract":"Discovering social relations, e.g., kinship, friendship, etc., from visual contents can make machines better interpret the behaviors and emotions of human beings. Existing studies mainly focus on recognizing social relations from still images while neglecting another important media--video. On one hand, the actions and storylines in videos provide more important cues for social relation recognition. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same image from beginning to the end. To overcome these challenges, we propose a Multi-scale Spatial-Temporal Reasoning (MSTR) framework to recognize social relations from videos. For the spatial representation, we not only adopt a temporal segment network to learn global action and scene information, but also design a Triple Graphs model to capture visual relations between persons and objects. For the temporal domain, we propose a Pyramid Graph Convolutional Network to perform temporal reasoning with multi-scale receptive fields, which can obtain both long-term and short-term storylines in videos. By this means, MSTR can comprehensively explore the multi-scale actions and storylines in spatial-temporal dimensions for social relation reasoning in videos. Extensive experiments on a new large-scale Video Social Relation dataset demonstrate the effectiveness of the proposed framework.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"104 1","pages":"3561-3569"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87486283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth 联合优化空间嵌入和聚类带宽的实例分割

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2019-06-01 DOI: 10.1109/CVPR.2019.00904

D. Neven, Bert De Brabandere, M. Proesmans, L. Gool

Current state-of-the-art instance segmentation methods are not suited for real-time applications like autonomous driving, which require fast execution times at high accuracy. Although the currently dominant proposal-based methods have high accuracy, they are slow and generate masks at a fixed and low resolution. Proposal-free methods, by contrast, can generate masks at high resolution and are often faster, but fail to reach the same accuracy as the proposal-based methods. In this work we propose a new clustering loss function for proposal-free instance segmentation. The loss function pulls the spatial embeddings of pixels belonging to the same instance together and jointly learns an instance-specific clustering bandwidth, maximizing the intersection-over-union of the resulting instance mask. When combined with a fast architecture, the network can perform instance segmentation in real-time while maintaining a high accuracy. We evaluate our method on the challenging Cityscapes benchmark and achieve top results (5% improvement over Mask R-CNN) at more than 10 fps on 2MP images.

当前最先进的实例分割方法不适合自动驾驶等实时应用，因为这些应用需要快速的执行时间和高精度。虽然目前主流的基于提议的方法具有很高的精度，但它们速度慢，并且生成的掩码固定且分辨率低。相比之下，无提议的方法可以在高分辨率下生成掩模，并且通常更快，但无法达到与基于提议的方法相同的精度。在这项工作中，我们提出了一种新的聚类损失函数用于无提议的实例分割。损失函数将属于同一实例的像素的空间嵌入拉到一起，并共同学习特定于实例的聚类带宽，最大化所得到的实例掩码的交集-过并。当与快速架构相结合时，网络可以在保持高精度的同时实时执行实例分割。我们在具有挑战性的城市景观基准上评估了我们的方法，并在200万像素的图像上以超过10 fps的速度获得了最佳结果(比Mask R-CNN提高了5%)。

引用次数: 215

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀