首页 > 最新文献

2019 IEEE/CVF International Conference on Computer Vision (ICCV)最新文献

英文 中文
Transductive Episodic-Wise Adaptive Metric for Few-Shot Learning 少镜头学习的转换情景自适应度量
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00370
Limeng Qiao, Yemin Shi, Jia Li, Yaowei Wang, Tiejun Huang, Yonghong Tian
Few-shot learning, which aims at extracting new concepts rapidly from extremely few examples of novel classes, has been featured into the meta-learning paradigm recently. Yet, the key challenge of how to learn a generalizable classifier with the capability of adapting to specific tasks with severely limited data still remains in this domain. To this end, we propose a Transductive Episodic-wise Adaptive Metric (TEAM) framework for few-shot learning, by integrating the meta-learning paradigm with both deep metric learning and transductive inference. With exploring the pairwise constraints and regularization prior within each task, we explicitly formulate the adaptation procedure into a standard semi-definite programming problem. By solving the problem with its closed-form solution on the fly with the setup of transduction, our approach efficiently tailors an episodic-wise metric for each task to adapt all features from a shared task-agnostic embedding space into a more discriminative task-specific metric space. Moreover, we further leverage an attention-based bi-directional similarity strategy for extracting the more robust relationship between queries and prototypes. Extensive experiments on three benchmark datasets show that our framework is superior to other existing approaches and achieves the state-of-the-art performance in the few-shot literature.
最近元学习范式中出现了“少量学习”(few -shot learning),它旨在从极少量的新类中快速提取新概念。然而,如何学习具有适应严重有限数据的特定任务能力的可泛化分类器的关键挑战仍然存在于该领域。为此,我们通过将元学习范式与深度度量学习和转导推理相结合,提出了一个用于小镜头学习的转导情景自适应度量(Transductive episodicwise Adaptive Metric, TEAM)框架。通过探索每个任务中的配对约束和正则化先验,我们将自适应过程明确地表述为标准的半确定规划问题。通过在运行中使用其封闭形式的解决方案解决问题,我们的方法有效地为每个任务定制了一个情节明智的度量,以将所有特征从共享的任务不可知嵌入空间适应为更具判别性的任务特定度量空间。此外,我们进一步利用基于注意力的双向相似性策略来提取查询和原型之间更健壮的关系。在三个基准数据集上的大量实验表明,我们的框架优于其他现有方法,并在少数几个文献中达到了最先进的性能。
{"title":"Transductive Episodic-Wise Adaptive Metric for Few-Shot Learning","authors":"Limeng Qiao, Yemin Shi, Jia Li, Yaowei Wang, Tiejun Huang, Yonghong Tian","doi":"10.1109/ICCV.2019.00370","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00370","url":null,"abstract":"Few-shot learning, which aims at extracting new concepts rapidly from extremely few examples of novel classes, has been featured into the meta-learning paradigm recently. Yet, the key challenge of how to learn a generalizable classifier with the capability of adapting to specific tasks with severely limited data still remains in this domain. To this end, we propose a Transductive Episodic-wise Adaptive Metric (TEAM) framework for few-shot learning, by integrating the meta-learning paradigm with both deep metric learning and transductive inference. With exploring the pairwise constraints and regularization prior within each task, we explicitly formulate the adaptation procedure into a standard semi-definite programming problem. By solving the problem with its closed-form solution on the fly with the setup of transduction, our approach efficiently tailors an episodic-wise metric for each task to adapt all features from a shared task-agnostic embedding space into a more discriminative task-specific metric space. Moreover, we further leverage an attention-based bi-directional similarity strategy for extracting the more robust relationship between queries and prototypes. Extensive experiments on three benchmark datasets show that our framework is superior to other existing approaches and achieves the state-of-the-art performance in the few-shot literature.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"77 1","pages":"3602-3611"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90576939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 158
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera 三维场景图:统一语义、三维空间和相机的结构
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00576
Iro Armeni, Zhi-Yang He, JunYoung Gwak, A. Zamir, Martin Fischer, J. Malik, S. Savarese
A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, 3D shapes, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, shape and other attributes), rooms (e.g., function, illumination type, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations.
对场景的全面语义理解对于许多应用程序都很重要,但是在什么空间中应该有不同的语义信息(例如,对象、场景类别、材料类型、3D形状等),它的结构应该是什么?渴望有一个统一的结构,承载不同类型的语义,我们遵循场景图范式在3D,生成一个3D场景图。给定一个3D网格和注册的全景图像,我们构建一个跨越整个建筑物的图,包括对象(例如,类别,材料,形状和其他属性),房间(例如,功能,照明类型等)和相机(例如,位置等)的语义,以及这些实体之间的关系。然而,如果手工完成,这个过程是非常繁重的劳动。为了缓解这种情况,我们设计了一个半自动框架,该框架采用现有的检测方法,并使用两个主要约束来增强它们:1 .在全景图上采样查询图像的框架,以最大化2D检测器的性能;跨2D检测的多视图一致性强制执行,起源于不同的相机位置。
{"title":"3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera","authors":"Iro Armeni, Zhi-Yang He, JunYoung Gwak, A. Zamir, Martin Fischer, J. Malik, S. Savarese","doi":"10.1109/ICCV.2019.00576","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00576","url":null,"abstract":"A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, 3D shapes, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, shape and other attributes), rooms (e.g., function, illumination type, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"5663-5672"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89199796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 198
Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query 基于自然语言查询的演员和动作视频分割的非对称交叉引导注意网络
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00404
H. Wang, Cheng Deng, Junchi Yan, D. Tao
Actor and action video segmentation from natural language query aims to selectively segment the actor and its action in a video based on an input textual description. Previous works mostly focus on learning simple correlation between two heterogeneous features of vision and language via dynamic convolution or fully convolutional classification. However, they ignore the linguistic variation of natural language query and have difficulty in modeling global visual context, which leads to unsatisfactory segmentation performance. To address these issues, we propose an asymmetric cross-guided attention network for actor and action video segmentation from natural language query. Specifically, we frame an asymmetric cross-guided attention network, which consists of vision guided language attention to reduce the linguistic variation of input query and language guided vision attention to incorporate query-focused global visual context simultaneously. Moreover, we adopt multi-resolution fusion scheme and weighted loss for foreground and background pixels to obtain further performance improvement. Extensive experiments on Actor-Action Dataset Sentences and J-HMDB Sentences show that our proposed approach notably outperforms state-of-the-art methods.
自然语言查询中的演员和动作视频分割,目的是根据输入的文本描述,有选择地分割视频中的演员及其动作。以往的工作主要集中在通过动态卷积或全卷积分类来学习视觉和语言两个异质特征之间的简单关联。然而,它们忽略了自然语言查询的语言差异,难以对全局视觉上下文进行建模,导致分割效果不理想。为了解决这些问题,我们提出了一个非对称的交叉引导注意力网络,用于从自然语言查询中分割演员和动作视频。具体而言,我们构建了一个非对称的交叉引导注意网络,该网络由视觉引导语言注意和语言引导视觉注意组成,以减少输入查询的语言变异,同时融入以查询为中心的全局视觉语境。采用多分辨率融合方案,对前景和背景像素进行加权损失,进一步提高性能。在Actor-Action数据集句子和J-HMDB句子上的大量实验表明,我们提出的方法明显优于最先进的方法。
{"title":"Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query","authors":"H. Wang, Cheng Deng, Junchi Yan, D. Tao","doi":"10.1109/ICCV.2019.00404","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00404","url":null,"abstract":"Actor and action video segmentation from natural language query aims to selectively segment the actor and its action in a video based on an input textual description. Previous works mostly focus on learning simple correlation between two heterogeneous features of vision and language via dynamic convolution or fully convolutional classification. However, they ignore the linguistic variation of natural language query and have difficulty in modeling global visual context, which leads to unsatisfactory segmentation performance. To address these issues, we propose an asymmetric cross-guided attention network for actor and action video segmentation from natural language query. Specifically, we frame an asymmetric cross-guided attention network, which consists of vision guided language attention to reduce the linguistic variation of input query and language guided vision attention to incorporate query-focused global visual context simultaneously. Moreover, we adopt multi-resolution fusion scheme and weighted loss for foreground and background pixels to obtain further performance improvement. Extensive experiments on Actor-Action Dataset Sentences and J-HMDB Sentences show that our proposed approach notably outperforms state-of-the-art methods.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"21 1","pages":"3938-3947"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84291464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Objects365: A Large-Scale, High-Quality Dataset for Object Detection Objects365:用于对象检测的大规模高质量数据集
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00852
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, Jian Sun
In this paper, we introduce a new large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 million, high-quality bounding boxes are manually labeled through a three-step, carefully designed annotation pipeline. It is the largest object detection dataset (with full annotation) so far and establishes a more challenging benchmark for the community. Objects365 can serve as a better feature learning dataset for localization-sensitive tasks like object detection and semantic segmentation. The Objects365 pre-trained models significantly outperform ImageNet pre-trained models with 5.6 points gain (42 vs 36.4) based on the standard setting of 90K iterations on COCO benchmark. Even compared with much long training time like 540K iterations, our Objects365 pretrained model with 90K iterations still have 2.7 points gain (42 vs 39.3). Meanwhile, the finetuning time can be greatly reduced (up to 10 times) when reaching the same accuracy. Better generalization ability of Object365 has also been verified on CityPersons, VOC segmentation, and ADE tasks. The dataset as well as the pretrained-models have been released at www.objects365.org.
在本文中,我们引入了一个新的大规模目标检测数据集,Objects365,它有365个对象类别,超过600K的训练图像。通过精心设计的标注管道,对超过1000万个高质量的边界框进行手动标记。它是迄今为止最大的目标检测数据集(带有完整的注释),并为社区建立了更具挑战性的基准。Objects365可以作为一个更好的特征学习数据集,用于对象检测和语义分割等对定位敏感的任务。基于COCO基准上90K次迭代的标准设置,Objects365预训练模型显著优于ImageNet预训练模型,获得5.6点的增益(42对36.4)。即使与像540K次迭代这样的长时间训练相比,我们的Objects365预训练模型与90K次迭代仍然有2.7分的增益(42比39.3)。同时,在达到相同精度的情况下,微调时间可大大减少(最多可减少10倍)。在CityPersons、VOC分割和ADE任务上也验证了Object365较好的泛化能力。该数据集以及预训练模型已在www.objects365.org上发布。
{"title":"Objects365: A Large-Scale, High-Quality Dataset for Object Detection","authors":"Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, Jian Sun","doi":"10.1109/ICCV.2019.00852","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00852","url":null,"abstract":"In this paper, we introduce a new large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 million, high-quality bounding boxes are manually labeled through a three-step, carefully designed annotation pipeline. It is the largest object detection dataset (with full annotation) so far and establishes a more challenging benchmark for the community. Objects365 can serve as a better feature learning dataset for localization-sensitive tasks like object detection and semantic segmentation. The Objects365 pre-trained models significantly outperform ImageNet pre-trained models with 5.6 points gain (42 vs 36.4) based on the standard setting of 90K iterations on COCO benchmark. Even compared with much long training time like 540K iterations, our Objects365 pretrained model with 90K iterations still have 2.7 points gain (42 vs 39.3). Meanwhile, the finetuning time can be greatly reduced (up to 10 times) when reaching the same accuracy. Better generalization ability of Object365 has also been verified on CityPersons, VOC segmentation, and ADE tasks. The dataset as well as the pretrained-models have been released at www.objects365.org.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"12 1","pages":"8429-8438"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88385462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 380
Deep Blind Hyperspectral Image Fusion
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00425
Wu Wang, Weihong Zeng, Yue Huang, Xinghao Ding, J. Paisley
Hyperspectral image fusion (HIF) reconstructs high spatial resolution hyperspectral images from low spatial resolution hyperspectral images and high spatial resolution multispectral images. Previous works usually assume that the linear mapping between the point spread functions of the hyperspectral camera and the spectral response functions of the conventional camera is known. This is unrealistic in many scenarios. We propose a method for blind HIF problem based on deep learning, where the estimation of the observation model and fusion process are optimized iteratively and alternatingly during the super-resolution reconstruction. In addition, the proposed framework enforces simultaneous spatial and spectral accuracy. Using three public datasets, the experimental results demonstrate that the proposed algorithm outperforms existing blind and non-blind methods.
高光谱图像融合(HIF)是由低空间分辨率高光谱图像和高空间分辨率多光谱图像重构高空间分辨率高光谱图像。以往的研究通常假设高光谱相机的点扩展函数与普通相机的光谱响应函数之间的线性映射是已知的。这在很多情况下是不现实的。提出了一种基于深度学习的盲HIF问题的方法,在超分辨率重建过程中迭代交替优化观测模型的估计和融合过程。此外,提出的框架强制同时空间和光谱精度。在三个公开数据集上的实验结果表明,该算法优于现有的盲法和非盲法。
{"title":"Deep Blind Hyperspectral Image Fusion","authors":"Wu Wang, Weihong Zeng, Yue Huang, Xinghao Ding, J. Paisley","doi":"10.1109/ICCV.2019.00425","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00425","url":null,"abstract":"Hyperspectral image fusion (HIF) reconstructs high spatial resolution hyperspectral images from low spatial resolution hyperspectral images and high spatial resolution multispectral images. Previous works usually assume that the linear mapping between the point spread functions of the hyperspectral camera and the spectral response functions of the conventional camera is known. This is unrealistic in many scenarios. We propose a method for blind HIF problem based on deep learning, where the estimation of the observation model and fusion process are optimized iteratively and alternatingly during the super-resolution reconstruction. In addition, the proposed framework enforces simultaneous spatial and spectral accuracy. Using three public datasets, the experimental results demonstrate that the proposed algorithm outperforms existing blind and non-blind methods.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"14 1","pages":"4149-4158"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86615594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
InGAN: Capturing and Retargeting the “DNA” of a Natural Image InGAN:捕捉和重新定位自然图像的“DNA”
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00459
Assaf Shocher, Shai Bagon, Phillip Isola, M. Irani
Generative Adversarial Networks (GANs) typically learn a distribution of images in a large image dataset, and are then able to generate new images from this distribution. However, each natural image has its own internal statistics, captured by its unique distribution of patches. In this paper we propose an ``Internal GAN'' (InGAN) -- an image-specific GAN -- which trains on a single input image and learns its internal distribution of patches. It is then able to synthesize a plethora of new natural images of significantly different sizes, shapes and aspect-ratios – all with the same internal patch-distribution (same ``DNA'') as the input image. In particular, despite large changes in global size/shape of the image, all elements inside the image maintain their local size/shape. InGAN is fully unsupervised, requiring no additional data other than the input image itself. Once trained on the input image, it can remap the input to any size or shape in a single feedforward pass, while preserving the same internal patch distribution. InGAN provides a unified framework for a variety of tasks, bridging the gap between textures and natural images.
生成式对抗网络(GANs)通常学习大型图像数据集中的图像分布,然后能够从该分布中生成新图像。然而,每个自然图像都有自己的内部统计数据,通过其独特的斑块分布来捕获。在本文中,我们提出了一种“内部GAN”(InGAN)——一种特定于图像的GAN——它在单个输入图像上进行训练并学习其内部补丁分布。然后,它能够合成大量大小、形状和纵横比明显不同的新自然图像——所有这些图像都具有与输入图像相同的内部斑块分布(相同的“DNA”)。特别是,尽管图像的全局尺寸/形状发生了很大的变化,但图像内的所有元素都保持其局部尺寸/形状。InGAN是完全无监督的,除了输入图像本身之外,不需要额外的数据。一旦对输入图像进行训练,它可以在单个前馈通道中将输入重新映射为任何大小或形状,同时保持相同的内部补丁分布。InGAN为各种任务提供了统一的框架,弥合了纹理和自然图像之间的差距。
{"title":"InGAN: Capturing and Retargeting the “DNA” of a Natural Image","authors":"Assaf Shocher, Shai Bagon, Phillip Isola, M. Irani","doi":"10.1109/ICCV.2019.00459","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00459","url":null,"abstract":"Generative Adversarial Networks (GANs) typically learn a distribution of images in a large image dataset, and are then able to generate new images from this distribution. However, each natural image has its own internal statistics, captured by its unique distribution of patches. In this paper we propose an ``Internal GAN'' (InGAN) -- an image-specific GAN -- which trains on a single input image and learns its internal distribution of patches. It is then able to synthesize a plethora of new natural images of significantly different sizes, shapes and aspect-ratios – all with the same internal patch-distribution (same ``DNA'') as the input image. In particular, despite large changes in global size/shape of the image, all elements inside the image maintain their local size/shape. InGAN is fully unsupervised, requiring no additional data other than the input image itself. Once trained on the input image, it can remap the input to any size or shape in a single feedforward pass, while preserving the same internal patch distribution. InGAN provides a unified framework for a variety of tasks, bridging the gap between textures and natural images.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"4491-4500"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86751169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Discriminative Feature Transformation for Occluded Pedestrian Detection 遮挡行人检测的判别特征变换
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00965
Chunluan Zhou, Ming Yang, Junsong Yuan
Despite promising performance achieved by deep con- volutional neural networks for non-occluded pedestrian de- tection, it remains a great challenge to detect partially oc- cluded pedestrians. Compared with non-occluded pedes- trian examples, it is generally more difficult to distinguish occluded pedestrian examples from background in featue space due to the missing of occluded parts. In this paper, we propose a discriminative feature transformation which en- forces feature separability of pedestrian and non-pedestrian examples to handle occlusions for pedestrian detection. Specifically, in feature space it makes pedestrian exam- ples approach the centroid of easily classified non-occluded pedestrian examples and pushes non-pedestrian examples close to the centroid of easily classified non-pedestrian ex- amples. Such a feature transformation partially compen- sates the missing contribution of occluded parts in feature space, therefore improving the performance for occluded pedestrian detection. We implement our approach in the Fast R-CNN framework by adding one transformation net- work branch. We validate the proposed approach on two widely used pedestrian detection datasets: Caltech and CityPersons. Experimental results show that our approach achieves promising performance for both non-occluded and occluded pedestrian detection.
尽管深度卷积神经网络在无遮挡行人检测方面取得了良好的效果,但检测部分遮挡行人仍然是一个巨大的挑战。与未遮挡的行人样例相比,由于遮挡部分的缺失,在特征空间中区分遮挡的行人样例与背景的难度较大。本文提出了一种判别特征变换方法,利用行人和非行人样本的特征可分离性来处理行人遮挡。具体来说,在特征空间中,它使行人检测样本接近易分类非遮挡行人样例的质心,并将非行人样例推向易分类非行人样例的质心。这种特征变换部分补偿了遮挡部分在特征空间中的缺失贡献,从而提高了遮挡行人检测的性能。我们在Fast R-CNN框架中通过增加一个转换网络分支来实现我们的方法。我们在两个广泛使用的行人检测数据集(Caltech和CityPersons)上验证了所提出的方法。实验结果表明,该方法在无遮挡和遮挡的行人检测中都取得了很好的效果。
{"title":"Discriminative Feature Transformation for Occluded Pedestrian Detection","authors":"Chunluan Zhou, Ming Yang, Junsong Yuan","doi":"10.1109/ICCV.2019.00965","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00965","url":null,"abstract":"Despite promising performance achieved by deep con- volutional neural networks for non-occluded pedestrian de- tection, it remains a great challenge to detect partially oc- cluded pedestrians. Compared with non-occluded pedes- trian examples, it is generally more difficult to distinguish occluded pedestrian examples from background in featue space due to the missing of occluded parts. In this paper, we propose a discriminative feature transformation which en- forces feature separability of pedestrian and non-pedestrian examples to handle occlusions for pedestrian detection. Specifically, in feature space it makes pedestrian exam- ples approach the centroid of easily classified non-occluded pedestrian examples and pushes non-pedestrian examples close to the centroid of easily classified non-pedestrian ex- amples. Such a feature transformation partially compen- sates the missing contribution of occluded parts in feature space, therefore improving the performance for occluded pedestrian detection. We implement our approach in the Fast R-CNN framework by adding one transformation net- work branch. We validate the proposed approach on two widely used pedestrian detection datasets: Caltech and CityPersons. Experimental results show that our approach achieves promising performance for both non-occluded and occluded pedestrian detection.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"327 1","pages":"9556-9565"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86778241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Markerless Outdoor Human Motion Capture Using Multiple Autonomous Micro Aerial Vehicles 使用多个自主微型飞行器的无标记户外人体动作捕捉
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00091
Nitin Saini, E. Price, Rahul Tallamraju, R. Enficiaud, R. Ludwig, Igor Martinovic, Aamir Ahmad, Michael J. Black
Capturing human motion in natural scenarios means moving motion capture out of the lab and into the wild. Typical approaches rely on fixed, calibrated, cameras and reflective markers on the body, significantly limiting the motions that can be captured. To make motion capture truly unconstrained, we describe the first fully autonomous outdoor capture system based on flying vehicles. We use multiple micro-aerial-vehicles(MAVs), each equipped with a monocular RGB camera, an IMU, and a GPS receiver module. These detect the person, optimize their position, and localize themselves approximately. We then develop a markerless motion capture method that is suitable for this challenging scenario with a distant subject, viewed from above, with approximately calibrated and moving cameras. We combine multiple state-of-the-art 2D joint detectors with a 3D human body model and a powerful prior on human pose. We jointly optimize for 3D body pose and camera pose to robustly fit the 2D measurements. To our knowledge, this is the first successful demonstration of outdoor, full-body, markerless motion capture from autonomous flying vehicles.
在自然场景中捕捉人类动作意味着将动作捕捉从实验室搬到野外。典型的方法依赖于固定的、校准的相机和身体上的反射标记,这极大地限制了可以捕捉到的动作。为了使动作捕捉真正不受约束,我们描述了第一个基于飞行器的全自动户外捕捉系统。我们使用多个微型飞行器(MAVs),每个都配备了一个单目RGB相机,一个IMU和一个GPS接收器模块。它们检测人,优化他们的位置,并大致定位自己。然后,我们开发了一种无标记运动捕捉方法,适用于这种具有挑战性的场景,从上面看远处的主体,使用大约校准和移动的相机。我们将多个最先进的2D关节探测器与3D人体模型和强大的人体姿势先验相结合。我们共同优化了三维身体姿态和相机姿态,以鲁棒拟合二维测量值。据我们所知,这是第一次成功展示户外,全身,无标记的动作捕捉自动飞行车辆。
{"title":"Markerless Outdoor Human Motion Capture Using Multiple Autonomous Micro Aerial Vehicles","authors":"Nitin Saini, E. Price, Rahul Tallamraju, R. Enficiaud, R. Ludwig, Igor Martinovic, Aamir Ahmad, Michael J. Black","doi":"10.1109/ICCV.2019.00091","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00091","url":null,"abstract":"Capturing human motion in natural scenarios means moving motion capture out of the lab and into the wild. Typical approaches rely on fixed, calibrated, cameras and reflective markers on the body, significantly limiting the motions that can be captured. To make motion capture truly unconstrained, we describe the first fully autonomous outdoor capture system based on flying vehicles. We use multiple micro-aerial-vehicles(MAVs), each equipped with a monocular RGB camera, an IMU, and a GPS receiver module. These detect the person, optimize their position, and localize themselves approximately. We then develop a markerless motion capture method that is suitable for this challenging scenario with a distant subject, viewed from above, with approximately calibrated and moving cameras. We combine multiple state-of-the-art 2D joint detectors with a 3D human body model and a powerful prior on human pose. We jointly optimize for 3D body pose and camera pose to robustly fit the 2D measurements. To our knowledge, this is the first successful demonstration of outdoor, full-body, markerless motion capture from autonomous flying vehicles.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"26 1","pages":"823-832"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87004476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Deep Learning for Light Field Saliency Detection 光场显著性检测的深度学习
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00893
Tiantian Wang, Yongri Piao, Huchuan Lu, Xiao Li, Lihe Zhang
Recent research in 4D saliency detection is limited by the deficiency of a large-scale 4D light field dataset. To address this, we introduce a new dataset to assist the subsequent research in 4D light field saliency detection. To the best of our knowledge, this is to date the largest light field dataset in which the dataset provides 1465 all-focus images with human-labeled ground truth masks and the corresponding focal stacks for every light field image. To verify the effectiveness of the light field data, we first introduce a fusion framework which includes two CNN streams where the focal stacks and all-focus images serve as the input. The focal stack stream utilizes a recurrent attention mechanism to adaptively learn to integrate every slice in the focal stack, which benefits from the extracted features of the good slices. Then it is incorporated with the output map generated by the all-focus stream to make the saliency prediction. In addition, we introduce adversarial examples by adding noise intentionally into images to help train the deep network, which can improve the robustness of the proposed network. The noise is designed by users, which is imperceptible but can fool the CNNs to make the wrong prediction. Extensive experiments show the effectiveness and superiority of the proposed model on the popular evaluation metrics. The proposed method performs favorably compared with the existing 2D, 3D and 4D saliency detection methods on the proposed dataset and existing LFSD light field dataset. The code and results can be found at https://github.com/OIPLab-DUT/ ICCV2019_Deeplightfield_Saliency. Moreover, to facilitate research in this field, all images we collected are shared in a ready-to-use manner.
由于缺乏大规模的四维光场数据集,目前在四维显著性检测方面的研究受到了限制。为了解决这个问题,我们引入了一个新的数据集来辅助后续的4D光场显著性检测研究。据我们所知,这是迄今为止最大的光场数据集,其中数据集提供了1465张全聚焦图像,其中包含人工标记的地面真相掩模和每个光场图像的相应焦点堆栈。为了验证光场数据的有效性,我们首先引入了一个融合框架,该框架包括两个CNN流,其中焦点堆栈和全聚焦图像作为输入。焦点叠流利用循环注意机制自适应学习整合焦点叠中的每个切片,这得益于提取好的切片的特征。然后结合全焦点流生成的输出图进行显著性预测。此外,我们通过有意地在图像中添加噪声来引入对抗示例,以帮助训练深度网络,这可以提高所提出网络的鲁棒性。噪声是由用户设计的,它是难以察觉的,但可以欺骗cnn做出错误的预测。大量的实验证明了该模型在常用评价指标上的有效性和优越性。与现有的二维、三维和四维显著性检测方法相比,本文方法在本文数据集和现有的LFSD光场数据集上表现良好。代码和结果可以在https://github.com/OIPLab-DUT/ ICCV2019_Deeplightfield_Saliency上找到。此外,为了促进这一领域的研究,我们收集的所有图像都以现成的方式共享。
{"title":"Deep Learning for Light Field Saliency Detection","authors":"Tiantian Wang, Yongri Piao, Huchuan Lu, Xiao Li, Lihe Zhang","doi":"10.1109/ICCV.2019.00893","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00893","url":null,"abstract":"Recent research in 4D saliency detection is limited by the deficiency of a large-scale 4D light field dataset. To address this, we introduce a new dataset to assist the subsequent research in 4D light field saliency detection. To the best of our knowledge, this is to date the largest light field dataset in which the dataset provides 1465 all-focus images with human-labeled ground truth masks and the corresponding focal stacks for every light field image. To verify the effectiveness of the light field data, we first introduce a fusion framework which includes two CNN streams where the focal stacks and all-focus images serve as the input. The focal stack stream utilizes a recurrent attention mechanism to adaptively learn to integrate every slice in the focal stack, which benefits from the extracted features of the good slices. Then it is incorporated with the output map generated by the all-focus stream to make the saliency prediction. In addition, we introduce adversarial examples by adding noise intentionally into images to help train the deep network, which can improve the robustness of the proposed network. The noise is designed by users, which is imperceptible but can fool the CNNs to make the wrong prediction. Extensive experiments show the effectiveness and superiority of the proposed model on the popular evaluation metrics. The proposed method performs favorably compared with the existing 2D, 3D and 4D saliency detection methods on the proposed dataset and existing LFSD light field dataset. The code and results can be found at https://github.com/OIPLab-DUT/ ICCV2019_Deeplightfield_Saliency. Moreover, to facilitate research in this field, all images we collected are shared in a ready-to-use manner.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"22 1","pages":"8837-8847"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87386933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
On Boosting Single-Frame 3D Human Pose Estimation via Monocular Videos 利用单目视频增强单帧3D人体姿态估计
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00228
Zhi Li, Xuan Wang, Fei Wang, Peilin Jiang
The premise of training an accurate 3D human pose estimation network is the possession of huge amount of richly annotated training data. Nonetheless, manually obtaining rich and accurate annotations is, even not impossible, tedious and slow. In this paper, we propose to exploit monocular videos to complement the training dataset for the single-image 3D human pose estimation tasks. At the beginning, a baseline model is trained with a small set of annotations. By fixing some reliable estimations produced by the resulting model, our method automatically collects the annotations across the entire video as solving the 3D trajectory completion problem. Then, the baseline model is further trained with the collected annotations to learn the new poses. We evaluate our method on the broadly-adopted Human3.6M and MPI-INF-3DHP datasets. As illustrated in experiments, given only a small set of annotations, our method successfully makes the model to learn new poses from unlabelled monocular videos, promoting the accuracies of the baseline model by about 10%. By contrast with previous approaches, our method does not rely on either multi-view imagery or any explicit 2D keypoint annotations.
训练出一个准确的三维人体姿态估计网络的前提是拥有大量注释丰富的训练数据。尽管如此,手动获取丰富而准确的注释,即使不是不可能的,也是冗长而缓慢的。在本文中,我们提出利用单目视频来补充训练数据集,用于单图像3D人体姿态估计任务。一开始,基线模型是用一小组注释进行训练的。通过修复一些由结果模型产生的可靠估计,我们的方法自动收集整个视频中的注释,以解决3D轨迹完成问题。然后,使用收集到的注释对基线模型进行进一步训练,以学习新的姿态。我们在广泛采用的Human3.6M和MPI-INF-3DHP数据集上评估了我们的方法。实验表明,仅在少量注释的情况下,我们的方法就成功地使模型从未标记的单目视频中学习新的姿势,将基线模型的准确率提高了10%左右。与之前的方法相比,我们的方法既不依赖于多视图图像,也不依赖于任何显式的2D关键点注释。
{"title":"On Boosting Single-Frame 3D Human Pose Estimation via Monocular Videos","authors":"Zhi Li, Xuan Wang, Fei Wang, Peilin Jiang","doi":"10.1109/ICCV.2019.00228","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00228","url":null,"abstract":"The premise of training an accurate 3D human pose estimation network is the possession of huge amount of richly annotated training data. Nonetheless, manually obtaining rich and accurate annotations is, even not impossible, tedious and slow. In this paper, we propose to exploit monocular videos to complement the training dataset for the single-image 3D human pose estimation tasks. At the beginning, a baseline model is trained with a small set of annotations. By fixing some reliable estimations produced by the resulting model, our method automatically collects the annotations across the entire video as solving the 3D trajectory completion problem. Then, the baseline model is further trained with the collected annotations to learn the new poses. We evaluate our method on the broadly-adopted Human3.6M and MPI-INF-3DHP datasets. As illustrated in experiments, given only a small set of annotations, our method successfully makes the model to learn new poses from unlabelled monocular videos, promoting the accuracies of the baseline model by about 10%. By contrast with previous approaches, our method does not rely on either multi-view imagery or any explicit 2D keypoint annotations.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"45 1","pages":"2192-2201"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87726684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
期刊
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1