2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition最新文献

英文中文

Representing and Learning High Dimensional Data with the Optimal Transport Map from a Probabilistic Viewpoint 从概率的角度用最优运输图表示和学习高维数据

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00820

Serim Park, Matthew Thorpe

In this paper, we propose a generative model in the space of diffeomorphic deformation maps. More precisely, we utilize the Kantarovich-Wasserstein metric and accompanying geometry to represent an image as a deformation from templates. Moreover, we incorporate a probabilistic viewpoint by assuming that each image is locally generated from a reference image. We capture the local structure by modelling the tangent planes at reference images. Once basis vectors for each tangent plane are learned via probabilistic PCA, we can sample a local coordinate, that can be inverted back to image space exactly. With experiments using 4 different datasets, we show that the generative tangent plane model in the optimal transport (OT) manifold can be learned with small numbers of images and can be used to create infinitely many 'unseen' images. In addition, the Bayesian classification accompanied with the probabilist modeling of the tangent planes shows improved accuracy over that done in the image space. Combining the results of our experiments supports our claim that certain datasets can be better represented with the Kantarovich-Wasserstein metric. We envision that the proposed method could be a practical solution to learning and representing data that is generated with templates in situatons where only limited numbers of data points are available.

在本文中，我们提出了一个微分同构变形映射空间中的生成模型。更准确地说，我们利用Kantarovich-Wasserstein度量和伴随的几何来将图像表示为模板的变形。此外，我们结合了概率观点，假设每个图像都是由参考图像局部生成的。我们通过对参考图像的切平面建模来捕获局部结构。一旦通过概率主成分分析学习到每个切平面的基向量，我们就可以对一个局部坐标进行采样，这个局部坐标可以精确地倒转回图像空间。通过使用4个不同数据集的实验，我们证明了最优传输(OT)流形中的生成切平面模型可以用少量图像学习，并且可以用于创建无限多的“看不见的”图像。此外，贝叶斯分类与切面的概率建模相比，在图像空间中显示出更高的精度。结合我们的实验结果支持我们的说法，即某些数据集可以更好地用Kantarovich-Wasserstein度量表示。我们设想，所提出的方法可能是一个实用的解决方案，可以在只有有限数量的数据点可用的情况下，学习和表示由模板生成的数据。

{"title":"Representing and Learning High Dimensional Data with the Optimal Transport Map from a Probabilistic Viewpoint","authors":"Serim Park, Matthew Thorpe","doi":"10.1109/CVPR.2018.00820","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00820","url":null,"abstract":"In this paper, we propose a generative model in the space of diffeomorphic deformation maps. More precisely, we utilize the Kantarovich-Wasserstein metric and accompanying geometry to represent an image as a deformation from templates. Moreover, we incorporate a probabilistic viewpoint by assuming that each image is locally generated from a reference image. We capture the local structure by modelling the tangent planes at reference images. Once basis vectors for each tangent plane are learned via probabilistic PCA, we can sample a local coordinate, that can be inverted back to image space exactly. With experiments using 4 different datasets, we show that the generative tangent plane model in the optimal transport (OT) manifold can be learned with small numbers of images and can be used to create infinitely many 'unseen' images. In addition, the Bayesian classification accompanied with the probabilist modeling of the tangent planes shows improved accuracy over that done in the image space. Combining the results of our experiments supports our claim that certain datasets can be better represented with the Kantarovich-Wasserstein metric. We envision that the proposed method could be a practical solution to learning and representing data that is generated with templates in situatons where only limited numbers of data points are available.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"122 1","pages":"7864-7872"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73107147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Finding Tiny Faces in the Wild with Generative Adversarial Network 用生成对抗网络在野外寻找小面孔

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00010

Yancheng Bai, Yongqiang Zhang, M. Ding, Bernard Ghanem

Face detection techniques have been developed for decades, and one of remaining open challenges is detecting small faces in unconstrained conditions. The reason is that tiny faces are often lacking detailed information and blurring. In this paper, we proposed an algorithm to directly generate a clear high-resolution face from a blurry small one by adopting a generative adversarial network (GAN). Toward this end, the basic GAN formulation achieves it by super-resolving and refining sequentially (e.g. SR-GAN and cycle-GAN). However, we design a novel network to address the problem of super-resolving and refining jointly. We also introduce new training losses to guide the generator network to recover fine details and to promote the discriminator network to distinguish real vs. fake and face vs. non-face simultaneously. Extensive experiments on the challenging dataset WIDER FACE demonstrate the effectiveness of our proposed method in restoring a clear high-resolution face from a blurry small one, and show that the detection performance outperforms other state-of-the-art methods.

人脸检测技术已经发展了几十年，其中一个仍然存在的挑战是在无约束条件下检测小人脸。原因是小脸往往缺乏详细的信息和模糊。本文提出了一种采用生成式对抗网络(GAN)从模糊的小人脸直接生成清晰的高分辨率人脸的算法。为此，基本GAN配方通过超分辨和依次精炼(例如SR-GAN和cycle-GAN)来实现。然而，我们设计了一个新的网络来解决超分辨和联合精炼的问题。我们还引入了新的训练损失来引导生成器网络恢复精细细节，并促进鉴别器网络同时区分真假、人脸与非人脸。在具有挑战性的数据集WIDER FACE上进行的大量实验表明，我们提出的方法在从模糊的小人脸中恢复清晰的高分辨率人脸方面是有效的，并且表明检测性能优于其他最先进的方法。

引用次数: 177

Single Image Dehazing via Conditional Generative Adversarial Network 基于条件生成对抗网络的单幅图像去雾

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00856

Runde Li, Jin-shan Pan, Zechao Li, Jinhui Tang

In this paper, we present an algorithm to directly restore a clear image from a hazy image. This problem is highly ill-posed and most existing algorithms often use hand-crafted features, e.g., dark channel, color disparity, maximum contrast, to estimate transmission maps and then atmospheric lights. In contrast, we solve this problem based on a conditional generative adversarial network (cGAN), where the clear image is estimated by an end-to-end trainable neural network. Different from the generative network in basic cGAN, we propose an encoder and decoder architecture so that it can generate better results. To generate realistic clear images, we further modify the basic cGAN formulation by introducing the VGG features and an L1-regularized gradient prior. We also synthesize a hazy dataset including indoor and outdoor scenes to train and evaluate the proposed algorithm. Extensive experimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on both synthetic dataset and real world hazy images.

本文提出了一种从模糊图像中直接恢复清晰图像的算法。这个问题是高度病态的，大多数现有的算法通常使用手工制作的特征，例如，暗通道，色差，最大对比度，来估计传输图，然后是大气光。相比之下，我们基于条件生成对抗网络(cGAN)解决了这个问题，其中清晰图像由端到端可训练神经网络估计。与基本cGAN中的生成网络不同，我们提出了一种编码器和解码器结构，使其产生更好的结果。为了生成真实清晰的图像，我们进一步修改了基本的cGAN公式，引入了VGG特征和l1正则化梯度先验。我们还合成了一个包括室内和室外场景的模糊数据集来训练和评估所提出的算法。大量的实验结果表明，该方法在合成数据集和真实世界的朦胧图像上都优于最先进的方法。

引用次数: 312

3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare 3D- rcnn:通过渲染和比较的实例级3D对象重建

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00375

Abhijit Kundu, Yin Li, James M. Rehg

We present a fast inverse-graphics framework for instance-level 3D scene understanding. We train a deep convolutional network that learns to map image regions to the full 3D shape and pose of all object instances in the image. Our method produces a compact 3D representation of the scene, which can be readily used for applications like autonomous driving. Many traditional 2D vision outputs, like instance segmentations and depth-maps, can be obtained by simply rendering our output 3D scene model. We exploit class-specific shape priors by learning a low dimensional shape-space from collections of CAD models. We present novel representations of shape and pose, that strive towards better 3D equivariance and generalization. In order to exploit rich supervisory signals in the form of 2D annotations like segmentation, we propose a differentiable Render-and-Compare loss that allows 3D shape and pose to be learned with 2D supervision. We evaluate our method on the challenging real-world datasets of Pascal3D+ and KITTI, where we achieve state-of-the-art results.

我们提出了一个用于实例级3D场景理解的快速逆图形框架。我们训练一个深度卷积网络，学习将图像区域映射到图像中所有对象实例的完整3D形状和姿态。我们的方法产生了一个紧凑的场景3D表示，可以很容易地用于自动驾驶等应用。许多传统的2D视觉输出，如实例分割和深度图，可以通过简单地渲染我们的输出3D场景模型来获得。我们通过从CAD模型集合中学习低维形状空间来利用类特定形状先验。我们提出了新颖的形状和姿态表示，力求实现更好的三维等变性和泛化。为了利用2D注释(如分割)形式的丰富监督信号，我们提出了一种可微分的渲染和比较损失，允许在2D监督下学习3D形状和姿势。我们在Pascal3D+和KITTI具有挑战性的真实世界数据集上评估了我们的方法，在那里我们获得了最先进的结果。

引用次数: 287

Hybrid Camera Pose Estimation 混合摄像机姿态估计

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00022

Federico Camposeco, Andrea Cohen, M. Pollefeys, Torsten Sattler

In this paper, we aim to solve the pose estimation problem of calibrated pinhole and generalized cameras w.r.t. a Structure-from-Motion (SfM) model by leveraging both 2D-3D correspondences as well as 2D-2D correspondences. Traditional approaches either focus on the use of 2D-3D matches, known as structure-based pose estimation or solely on 2D-2D matches (structure-less pose estimation). Absolute pose approaches are limited in their performance by the quality of the 3D point triangulations as well as the completeness of the 3D model. Relative pose approaches, on the other hand, while being more accurate, also tend to be far more computationally costly and often return dozens of possible solutions. This work aims to bridge the gap between these two paradigms. We propose a new RANSAC-based approach that automatically chooses the best type of solver to use at each iteration in a data-driven way. The solvers chosen by our RANSAC can range from pure structure-based or structure-less solvers, to any possible combination of hybrid solvers (i.e. using both types of matches) in between. A number of these new hybrid minimal solvers are also presented in this paper. Both synthetic and real data experiments show our approach to be as accurate as structure-less approaches, while staying close to the efficiency of structure-based methods.

在本文中，我们的目标是通过利用2D-3D对应和2D-2D对应来解决校准针孔相机和广义相机的姿态估计问题。传统的方法要么专注于使用2D-3D匹配，称为基于结构的姿态估计，要么只关注2D-2D匹配(无结构姿态估计)。绝对姿态方法的性能受到三维点三角剖分质量和三维模型完整性的限制。另一方面，相对姿态方法虽然更精确，但计算成本也要高得多，而且通常会返回几十个可能的解决方案。这项工作旨在弥合这两种范式之间的差距。我们提出了一种新的基于ransac的方法，该方法以数据驱动的方式自动选择在每次迭代中使用的最佳解算器类型。RANSAC选择的求解器可以是纯基于结构或无结构的求解器，也可以是混合求解器的任何可能组合(即使用两种类型的匹配)。本文还介绍了一些新的混合最小解。合成和实际数据实验表明，我们的方法与无结构方法一样准确，同时保持接近基于结构的方法的效率。

{"title":"Hybrid Camera Pose Estimation","authors":"Federico Camposeco, Andrea Cohen, M. Pollefeys, Torsten Sattler","doi":"10.1109/CVPR.2018.00022","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00022","url":null,"abstract":"In this paper, we aim to solve the pose estimation problem of calibrated pinhole and generalized cameras w.r.t. a Structure-from-Motion (SfM) model by leveraging both 2D-3D correspondences as well as 2D-2D correspondences. Traditional approaches either focus on the use of 2D-3D matches, known as structure-based pose estimation or solely on 2D-2D matches (structure-less pose estimation). Absolute pose approaches are limited in their performance by the quality of the 3D point triangulations as well as the completeness of the 3D model. Relative pose approaches, on the other hand, while being more accurate, also tend to be far more computationally costly and often return dozens of possible solutions. This work aims to bridge the gap between these two paradigms. We propose a new RANSAC-based approach that automatically chooses the best type of solver to use at each iteration in a data-driven way. The solvers chosen by our RANSAC can range from pure structure-based or structure-less solvers, to any possible combination of hybrid solvers (i.e. using both types of matches) in between. A number of these new hybrid minimal solvers are also presented in this paper. Both synthetic and real data experiments show our approach to be as accurate as structure-less approaches, while staying close to the efficiency of structure-based methods.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"16 2 1","pages":"136-144"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78023021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

HashGAN: Deep Learning to Hash with Pair Conditional Wasserstein GAN HashGAN:使用对条件Wasserstein GAN的深度学习哈希

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00140

Yue Cao, Bin Liu, Mingsheng Long, Jianmin Wang

Deep learning to hash improves image retrieval performance by end-to-end representation learning and hash coding from training data with pairwise similarity information. Subject to the scarcity of similarity information that is often expensive to collect for many application domains, existing deep learning to hash methods may overfit the training data and result in substantial loss of retrieval quality. This paper presents HashGAN, a novel architecture for deep learning to hash, which learns compact binary hash codes from both real images and diverse images synthesized by generative models. The main idea is to augment the training data with nearly real images synthesized from a new Pair Conditional Wasserstein GAN (PC-WGAN) conditioned on the pairwise similarity information. Extensive experiments demonstrate that HashGAN can generate high-quality binary hash codes and yield state-of-the-art image retrieval performance on three benchmarks, NUS-WIDE, CIFAR-10, and MS-COCO.

哈希深度学习通过端到端表示学习和从具有成对相似信息的训练数据中进行哈希编码来提高图像检索性能。由于相似度信息的稀缺性，并且在许多应用领域中收集相似度信息的成本往往很高，现有的深度学习哈希方法可能会对训练数据进行过拟合，从而导致检索质量的严重损失。本文提出了一种新颖的深度哈希学习架构HashGAN，它可以从真实图像和生成模型合成的各种图像中学习紧凑的二进制哈希码。其主要思想是利用基于对相似度信息的对条件Wasserstein GAN (PC-WGAN)合成的接近真实图像来增强训练数据。大量的实验表明，HashGAN可以生成高质量的二进制哈希码，并在三个基准(NUS-WIDE、CIFAR-10和MS-COCO)上产生最先进的图像检索性能。

引用次数: 94

A Prior-Less Method for Multi-face Tracking in Unconstrained Videos 无约束视频中多人脸跟踪的无先验方法

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00063

Chung-Ching Lin, Ying Hung

This paper presents a prior-less method for tracking and clustering an unknown number of human faces and maintaining their individual identities in unconstrained videos. The key challenge is to accurately track faces with partial occlusion and drastic appearance changes in multiple shots resulting from significant variations of makeup, facial expression, head pose and illumination. To address this challenge, we propose a new multi-face tracking and re-identification algorithm, which provides high accuracy in face association in the entire video with automatic cluster number generation, and is robust to outliers. We develop a co-occurrence model of multiple body parts to seamlessly create face tracklets, and recursively link tracklets to construct a graph for extracting clusters. A Gaussian Process model is introduced to compensate the deep feature insufficiency, and is further used to refine the linking results. The advantages of the proposed algorithm are demonstrated using a variety of challenging music videos and newly introduced body-worn camera videos. The proposed method obtains significant improvements over the state of the art [51], while relying less on handling video-specific prior information to achieve high performance.

本文提出了一种在无约束视频中对未知数量人脸进行跟踪和聚类并保持其个体身份的无先验方法。关键的挑战是在多个镜头中准确跟踪部分遮挡和剧烈外观变化的面部，这些变化是由于化妆、面部表情、头部姿势和照明的显著变化造成的。为了解决这一挑战，我们提出了一种新的多人脸跟踪和再识别算法，该算法通过自动生成聚类数，在整个视频中提供高精度的人脸关联，并且对异常值具有鲁棒性。我们建立了一个多身体部位的共现模型来无缝地创建面部轨迹，并递归地链接轨迹来构建一个图来提取聚类。引入高斯过程模型来补偿深度特征不足，并进一步对链接结果进行细化。通过各种具有挑战性的音乐视频和新引入的随身摄像机视频，证明了所提出算法的优点。本文提出的方法相对于目前的技术水平有了显著的改进[51]，同时减少了对处理视频特定先验信息的依赖，从而实现了高性能。

{"title":"A Prior-Less Method for Multi-face Tracking in Unconstrained Videos","authors":"Chung-Ching Lin, Ying Hung","doi":"10.1109/CVPR.2018.00063","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00063","url":null,"abstract":"This paper presents a prior-less method for tracking and clustering an unknown number of human faces and maintaining their individual identities in unconstrained videos. The key challenge is to accurately track faces with partial occlusion and drastic appearance changes in multiple shots resulting from significant variations of makeup, facial expression, head pose and illumination. To address this challenge, we propose a new multi-face tracking and re-identification algorithm, which provides high accuracy in face association in the entire video with automatic cluster number generation, and is robust to outliers. We develop a co-occurrence model of multiple body parts to seamlessly create face tracklets, and recursively link tracklets to construct a graph for extracting clusters. A Gaussian Process model is introduced to compensate the deep feature insufficiency, and is further used to refine the linking results. The advantages of the proposed algorithm are demonstrated using a variety of challenging music videos and newly introduced body-worn camera videos. The proposed method obtains significant improvements over the state of the art [51], while relying less on handling video-specific prior information to achieve high performance.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"80 1","pages":"538-547"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80880656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization HSA-RNN:用于视频摘要的层次结构自适应RNN

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00773

Bin Zhao, Xuelong Li, Xiaoqiang Lu

Although video summarization has achieved great success in recent years, few approaches have realized the influence of video structure on the summarization results. As we know, the video data follow a hierarchical structure, i.e., a video is composed of shots, and a shot is composed of several frames. Generally, shots provide the activity-level information for people to understand the video content. While few existing summarization approaches pay attention to the shot segmentation procedure. They generate shots by some trivial strategies, such as fixed length segmentation, which may destroy the underlying hierarchical structure of video data and further reduce the quality of generated summaries. To address this problem, we propose a structure-adaptive video summarization approach that integrates shot segmentation and video summarization into a Hierarchical Structure-Adaptive RNN, denoted as HSA-RNN. We evaluate the proposed approach on four popular datasets, i.e., SumMe, TVsum, CoSum and VTW. The experimental results have demonstrated the effectiveness of HSA-RNN in the video summarization task.

尽管近年来视频摘要取得了巨大的成功，但很少有方法意识到视频结构对摘要结果的影响。我们知道，视频数据遵循层次结构，即一个视频由多个镜头组成，一个镜头由多个帧组成。一般来说，镜头为人们理解视频内容提供了活动级别的信息。而现有的摘要方法很少关注镜头分割过程。它们通过一些琐碎的策略生成镜头，例如固定长度分割，这可能会破坏视频数据的底层层次结构，进一步降低生成摘要的质量。为了解决这个问题，我们提出了一种结构自适应视频摘要方法，该方法将镜头分割和视频摘要集成到一个层次结构自适应RNN中，称为HSA-RNN。我们在四个流行的数据集，即SumMe, TVsum, CoSum和VTW上评估了所提出的方法。实验结果证明了HSA-RNN在视频摘要任务中的有效性。

引用次数: 155

Fine-Grained Video Captioning for Sports Narrative 精细的体育叙事视频字幕

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00629

Huanyu Yu, Shuo Cheng, Bingbing Ni, Minsi Wang, Jian Zhang, Xiaokang Yang

Despite recent emergence of video caption methods, how to generate fine-grained video descriptions (i.e., long and detailed commentary about individual movements of multiple subjects as well as their frequent interactions) is far from being solved, which however has great applications such as automatic sports narrative. To this end, this work makes the following contributions. First, to facilitate this novel research of fine-grained video caption, we collected a novel dataset called Fine-grained Sports Narrative dataset (FSN) that contains 2K sports videos with ground-truth narratives from YouTube.com. Second, we develop a novel performance evaluation metric named Fine-grained Captioning Evaluation (FCE) to cope with this novel task. Considered as an extension of the widely used METEOR, it measures not only the linguistic performance but also whether the action details and their temporal orders are correctly described. Third, we propose a new framework for fine-grained sports narrative task. This network features three branches: 1) a spatio-temporal entity localization and role discovering sub-network; 2) a fine-grained action modeling sub-network for local skeleton motion description; and 3) a group relationship modeling sub-network to model interactions between players. We further fuse the features and decode them into long narratives by a hierarchically recurrent structure. Extensive experiments on the FSN dataset demonstrates the validity of the proposed framework for fine-grained video caption.

尽管最近出现了视频字幕方法，但如何生成细粒度的视频描述(即对多个主体的单个动作进行长而详细的评论，以及他们之间的频繁互动)还远远没有解决，但它在自动体育叙事等方面有着很大的应用。为此，本工作做出了以下贡献。首先，为了促进这种细粒度视频标题的新颖研究，我们收集了一个名为细粒度体育叙事数据集(FSN)的新颖数据集，该数据集包含来自YouTube.com的具有真实叙事的2K体育视频。其次，我们开发了一种新的性能评估指标，称为细粒度字幕评估(Fine-grained Captioning evaluation, FCE)。作为广泛使用的METEOR的扩展，它不仅衡量语言性能，而且衡量动作细节及其时间顺序是否被正确描述。第三，提出了细粒度体育叙事任务的新框架。该网络具有三个分支:1)时空实体定位和角色发现子网络;2)用于局部骨架运动描述的细粒度动作建模子网络;3)建立群体关系建模子网络，对参与者之间的互动进行建模。我们进一步融合了这些特征，并通过分层循环的结构将它们解码成长篇叙事。在FSN数据集上的大量实验证明了该框架对细粒度视频标题的有效性。

{"title":"Fine-Grained Video Captioning for Sports Narrative","authors":"Huanyu Yu, Shuo Cheng, Bingbing Ni, Minsi Wang, Jian Zhang, Xiaokang Yang","doi":"10.1109/CVPR.2018.00629","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00629","url":null,"abstract":"Despite recent emergence of video caption methods, how to generate fine-grained video descriptions (i.e., long and detailed commentary about individual movements of multiple subjects as well as their frequent interactions) is far from being solved, which however has great applications such as automatic sports narrative. To this end, this work makes the following contributions. First, to facilitate this novel research of fine-grained video caption, we collected a novel dataset called Fine-grained Sports Narrative dataset (FSN) that contains 2K sports videos with ground-truth narratives from YouTube.com. Second, we develop a novel performance evaluation metric named Fine-grained Captioning Evaluation (FCE) to cope with this novel task. Considered as an extension of the widely used METEOR, it measures not only the linguistic performance but also whether the action details and their temporal orders are correctly described. Third, we propose a new framework for fine-grained sports narrative task. This network features three branches: 1) a spatio-temporal entity localization and role discovering sub-network; 2) a fine-grained action modeling sub-network for local skeleton motion description; and 3) a group relationship modeling sub-network to model interactions between players. We further fuse the features and decode them into long narratives by a hierarchically recurrent structure. Extensive experiments on the FSN dataset demonstrates the validity of the proposed framework for fine-grained video caption.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"15 2 1","pages":"6006-6015"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78520722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

FaceID-GAN: Learning a Symmetry Three-Player GAN for Identity-Preserving Face Synthesis FaceID-GAN:学习一种对称的三人GAN，用于保持身份的人脸合成

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00092

Yujun Shen, Ping Luo, Junjie Yan, Xiaogang Wang, Xiaoou Tang

Face synthesis has achieved advanced development by using generative adversarial networks (GANs). Existing methods typically formulate GAN as a two-player game, where a discriminator distinguishes face images from the real and synthesized domains, while a generator reduces its discriminativeness by synthesizing a face of photorealistic quality. Their competition converges when the discriminator is unable to differentiate these two domains. Unlike two-player GANs, this work generates identity-preserving faces by proposing FaceID-GAN, which treats a classifier of face identity as the third player, competing with the generator by distinguishing the identities of the real and synthesized faces (see Fig.1). A stationary point is reached when the generator produces faces that have high quality as well as preserve identity. Instead of simply modeling the identity classifier as an additional discriminator, FaceID-GAN is formulated by satisfying information symmetry, which ensures that the real and synthesized images are projected into the same feature space. In other words, the identity classifier is used to extract identity features from both input (real) and output (synthesized) face images of the generator, substantially alleviating training difficulty of GAN. Extensive experiments show that FaceID-GAN is able to generate faces of arbitrary viewpoint while preserve identity, outperforming recent advanced approaches.

基于生成对抗网络(GANs)的人脸合成技术已经取得了长足的发展。现有方法通常将GAN描述为一个双人游戏，其中判别器将人脸图像从真实域和合成域区分开来，而生成器通过合成具有逼真质量的人脸来降低其判别性。当鉴别器无法区分这两个域时，它们的竞争趋于收敛。与双玩家gan不同，这项工作通过提出FaceID-GAN来生成身份保持的人脸，FaceID-GAN将人脸身份分类器作为第三个玩家，通过区分真实人脸和合成人脸的身份与生成器竞争(见图1)。当生成的人脸在保持身份的同时又具有高质量时，达到一个静止点。FaceID-GAN不是简单地将身份分类器建模为附加的鉴别器，而是通过满足信息对称来制定，从而确保真实图像和合成图像投影到相同的特征空间中。换句话说，使用身份分类器从生成器的输入(真实)和输出(合成)人脸图像中提取身份特征，大大减轻了GAN的训练难度。大量实验表明，FaceID-GAN能够在保持身份的同时生成任意视点的人脸，优于最近的先进方法。

{"title":"FaceID-GAN: Learning a Symmetry Three-Player GAN for Identity-Preserving Face Synthesis","authors":"Yujun Shen, Ping Luo, Junjie Yan, Xiaogang Wang, Xiaoou Tang","doi":"10.1109/CVPR.2018.00092","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00092","url":null,"abstract":"Face synthesis has achieved advanced development by using generative adversarial networks (GANs). Existing methods typically formulate GAN as a two-player game, where a discriminator distinguishes face images from the real and synthesized domains, while a generator reduces its discriminativeness by synthesizing a face of photorealistic quality. Their competition converges when the discriminator is unable to differentiate these two domains. Unlike two-player GANs, this work generates identity-preserving faces by proposing FaceID-GAN, which treats a classifier of face identity as the third player, competing with the generator by distinguishing the identities of the real and synthesized faces (see Fig.1). A stationary point is reached when the generator produces faces that have high quality as well as preserve identity. Instead of simply modeling the identity classifier as an additional discriminator, FaceID-GAN is formulated by satisfying information symmetry, which ensures that the real and synthesized images are projected into the same feature space. In other words, the identity classifier is used to extract identity features from both input (real) and output (synthesized) face images of the generator, substantially alleviating training difficulty of GAN. Extensive experiments show that FaceID-GAN is able to generate faces of arbitrary viewpoint while preserve identity, outperforming recent advanced approaches.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"1 1","pages":"821-830"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72882487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 150

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀