首页 > 最新文献

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition最新文献

英文 中文
Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features 基于迭代挖掘公共对象特征的弱监督语义分割
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00147
Xiang Wang, Shaodi You, Xi Li, Huimin Ma
Weakly-supervised semantic segmentation under image tags supervision is a challenging task as it directly associates high-level semantic to low-level appearance. To bridge this gap, in this paper, we propose an iterative bottom-up and top-down framework which alternatively expands object regions and optimizes segmentation network. We start from initial localization produced by classification networks. While classification networks are only responsive to small and coarse discriminative object regions, we argue that, these regions contain significant common features about objects. So in the bottom-up step, we mine common object features from the initial localization and expand object regions with the mined features. To supplement non-discriminative regions, saliency maps are then considered under Bayesian framework to refine the object regions. Then in the top-down step, the refined object regions are used as supervision to train the segmentation network and to predict object masks. These object masks provide more accurate localization and contain more regions of object. Further, we take these object masks as initial localization and mine common object features from them. These processes are conducted iteratively to progressively produce fine object masks and optimize segmentation networks. Experimental results on Pascal VOC 2012 dataset demonstrate that the proposed method outperforms previous state-of-the-art methods by a large margin.
图像标签监督下的弱监督语义分割是一项具有挑战性的任务,因为它直接将高级语义与低级外观联系起来。为了弥补这一差距,在本文中,我们提出了一个迭代的自下而上和自上而下的框架,交替扩展目标区域和优化分割网络。我们从分类网络产生的初始定位开始。虽然分类网络只响应小而粗糙的区分对象区域,但我们认为,这些区域包含了关于对象的重要共同特征。因此,在自下而上的步骤中,我们从初始定位中挖掘出共同的目标特征,并用挖掘出的特征扩展目标区域。为了补充非区分区域,在贝叶斯框架下考虑显著性映射来细化目标区域。然后在自顶向下的步骤中,使用改进的目标区域作为监督来训练分割网络并预测目标掩码。这些对象掩码提供了更精确的定位,并包含了更多的对象区域。进一步,我们将这些目标掩模作为初始定位,从中挖掘出共同的目标特征。这些过程是迭代进行的,逐步产生精细的对象掩模和优化分割网络。在Pascal VOC 2012数据集上的实验结果表明,本文提出的方法在很大程度上优于现有的最先进的方法。
{"title":"Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features","authors":"Xiang Wang, Shaodi You, Xi Li, Huimin Ma","doi":"10.1109/CVPR.2018.00147","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00147","url":null,"abstract":"Weakly-supervised semantic segmentation under image tags supervision is a challenging task as it directly associates high-level semantic to low-level appearance. To bridge this gap, in this paper, we propose an iterative bottom-up and top-down framework which alternatively expands object regions and optimizes segmentation network. We start from initial localization produced by classification networks. While classification networks are only responsive to small and coarse discriminative object regions, we argue that, these regions contain significant common features about objects. So in the bottom-up step, we mine common object features from the initial localization and expand object regions with the mined features. To supplement non-discriminative regions, saliency maps are then considered under Bayesian framework to refine the object regions. Then in the top-down step, the refined object regions are used as supervision to train the segmentation network and to predict object masks. These object masks provide more accurate localization and contain more regions of object. Further, we take these object masks as initial localization and mine common object features from them. These processes are conducted iteratively to progressively produce fine object masks and optimize segmentation networks. Experimental results on Pascal VOC 2012 dataset demonstrate that the proposed method outperforms previous state-of-the-art methods by a large margin.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"455 1","pages":"1354-1362"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79750729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 271
A Bi-Directional Message Passing Model for Salient Object Detection 显著目标检测的双向消息传递模型
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00187
Lu Zhang, Ju Dai, Huchuan Lu, You He, G. Wang
Recent progress on salient object detection is beneficial from Fully Convolutional Neural Network (FCN). The saliency cues contained in multi-level convolutional features are complementary for detecting salient objects. How to integrate multi-level features becomes an open problem in saliency detection. In this paper, we propose a novel bi-directional message passing model to integrate multi-level features for salient object detection. At first, we adopt a Multi-scale Context-aware Feature Extraction Module (MCFEM) for multi-level feature maps to capture rich context information. Then a bi-directional structure is designed to pass messages between multi-level features, and a gate function is exploited to control the message passing rate. We use the features after message passing, which simultaneously encode semantic information and spatial details, to predict saliency maps. Finally, the predicted results are efficiently combined to generate the final saliency map. Quantitative and qualitative experiments on five benchmark datasets demonstrate that our proposed model performs favorably against the state-of-the-art methods under different evaluation metrics.
近年来在显著目标检测方面取得的进展得益于全卷积神经网络(FCN)。多层卷积特征中包含的显著性线索对于显著性目标的检测是互补的。如何整合多层次特征成为显著性检测中的一个开放性问题。在本文中,我们提出了一种新的双向消息传递模型,该模型集成了显著目标检测的多层次特征。首先,采用多尺度上下文感知特征提取模块(Multi-scale context -aware Feature Extraction Module, MCFEM)进行多层次特征映射,获取丰富的上下文信息。在此基础上,设计了多层特征间的双向消息传递结构,并利用门函数控制消息的通过率。我们利用信息传递后的特征,同时编码语义信息和空间细节,来预测显著性地图。最后,将预测结果有效地组合在一起,生成最终的显著性图。在五个基准数据集上进行的定量和定性实验表明,我们提出的模型在不同的评估指标下优于最先进的方法。
{"title":"A Bi-Directional Message Passing Model for Salient Object Detection","authors":"Lu Zhang, Ju Dai, Huchuan Lu, You He, G. Wang","doi":"10.1109/CVPR.2018.00187","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00187","url":null,"abstract":"Recent progress on salient object detection is beneficial from Fully Convolutional Neural Network (FCN). The saliency cues contained in multi-level convolutional features are complementary for detecting salient objects. How to integrate multi-level features becomes an open problem in saliency detection. In this paper, we propose a novel bi-directional message passing model to integrate multi-level features for salient object detection. At first, we adopt a Multi-scale Context-aware Feature Extraction Module (MCFEM) for multi-level feature maps to capture rich context information. Then a bi-directional structure is designed to pass messages between multi-level features, and a gate function is exploited to control the message passing rate. We use the features after message passing, which simultaneously encode semantic information and spatial details, to predict saliency maps. Finally, the predicted results are efficiently combined to generate the final saliency map. Quantitative and qualitative experiments on five benchmark datasets demonstrate that our proposed model performs favorably against the state-of-the-art methods under different evaluation metrics.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"11 1","pages":"1741-1750"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79855826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 354
Exploiting Transitivity for Learning Person Re-identification Models on a Budget 利用及物性学习预算上的人物再识别模型
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00738
Sourya Roy, S. Paul, N. Young, A. Roy-Chowdhury
Minimization of labeling effort for person re-identification in camera networks is an important problem as most of the existing popular methods are supervised and they require large amount of manual annotations, acquiring which is a tedious job. In this work, we focus on this labeling effort minimization problem and approach it as a subset selection task where the objective is to select an optimal subset of image-pairs for labeling without compromising performance. Towards this goal, our proposed scheme first represents any camera network (with k number of cameras) as an edge weighted complete k-partite graph where each vertex denotes a person and similarity scores between persons are used as edge-weights. Then in the second stage, our algorithm selects an optimal subset of pairs by solving a triangle free subgraph maximization problem on the k-partite graph. This sub-graph weight maximization problem is NP-hard (at least for k = 4) which means for large datasets the optimization problem becomes intractable. In order to make our framework scalable, we propose two polynomial time approximately-optimal algorithms. The first algorithm is a 1/2-approximation algorithm which runs in linear time in the number of edges. The second algorithm is a greedy algorithm with sub-quadratic (in number of edges) time-complexity. Experiments on three state-of-the-art datasets depict that the proposed approach requires on an average only 8-15% manually labeled pairs in order to achieve the performance when all the pairs are manually annotated.
摄像机网络中人员再识别的标注工作量最小化是一个重要的问题,因为现有的流行方法大多是有监督的,它们需要大量的人工标注,获取这些标注是一项繁琐的工作。在这项工作中,我们专注于标记工作量最小化问题,并将其作为一个子集选择任务来处理,其目标是在不影响性能的情况下选择图像对的最佳子集进行标记。为了实现这一目标,我们提出的方案首先将任意摄像机网络(具有k个摄像机)表示为一个边缘加权的完全k部图,其中每个顶点表示一个人,并且使用人之间的相似性分数作为边缘权重。在第二阶段,我们的算法通过求解k部图上的一个无三角子图最大化问题来选择最优的子图子集。这个子图权重最大化问题是np困难的(至少对于k = 4),这意味着对于大型数据集,优化问题变得难以处理。为了使我们的框架具有可扩展性,我们提出了两个多项式时间近似最优算法。第一个算法是1/2近似算法,它在线性时间内运行边的数量。第二种算法是时间复杂度为次二次(边数)的贪心算法。在三个最先进的数据集上的实验表明,该方法平均只需要8-15%的手动标记对就可以达到所有对都手动注释时的性能。
{"title":"Exploiting Transitivity for Learning Person Re-identification Models on a Budget","authors":"Sourya Roy, S. Paul, N. Young, A. Roy-Chowdhury","doi":"10.1109/CVPR.2018.00738","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00738","url":null,"abstract":"Minimization of labeling effort for person re-identification in camera networks is an important problem as most of the existing popular methods are supervised and they require large amount of manual annotations, acquiring which is a tedious job. In this work, we focus on this labeling effort minimization problem and approach it as a subset selection task where the objective is to select an optimal subset of image-pairs for labeling without compromising performance. Towards this goal, our proposed scheme first represents any camera network (with k number of cameras) as an edge weighted complete k-partite graph where each vertex denotes a person and similarity scores between persons are used as edge-weights. Then in the second stage, our algorithm selects an optimal subset of pairs by solving a triangle free subgraph maximization problem on the k-partite graph. This sub-graph weight maximization problem is NP-hard (at least for k = 4) which means for large datasets the optimization problem becomes intractable. In order to make our framework scalable, we propose two polynomial time approximately-optimal algorithms. The first algorithm is a 1/2-approximation algorithm which runs in linear time in the number of edges. The second algorithm is a greedy algorithm with sub-quadratic (in number of edges) time-complexity. Experiments on three state-of-the-art datasets depict that the proposed approach requires on an average only 8-15% manually labeled pairs in order to achieve the performance when all the pairs are manually annotated.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"4 1","pages":"7064-7072"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84515106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos 360°视频中用于弱监督显著性预测的立方体填充
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00154
Hsien-Tzu Cheng, Chun-Hung Chao, Jin-Dong Dong, Hao-Kai Wen, Tyng-Luh Liu, Min Sun
Automatic saliency prediction in 360° videos is critical for viewpoint guidance applications (e.g., Facebook 360 Guide). We propose a spatial-temporal network which is (1) weakly-supervised trained and (2) tailor-made for 360° viewing sphere. Note that most existing methods are less scalable since they rely on annotated saliency map for training. Most importantly, they convert 360° sphere to 2D images (e.g., a single equirectangular image or multiple separate Normal Field-of-View (NFoV) images) which introduces distortion and image boundaries. In contrast, we propose a simple and effective Cube Padding (CP) technique as follows. Firstly, we render the 360° view on six faces of a cube using perspective projection. Thus, it introduces very little distortion. Then, we concatenate all six faces while utilizing the connectivity between faces on the cube for image padding (i.e., Cube Padding) in convolution, pooling, convolutional LSTM layers. In this way, CP introduces no image boundary while being applicable to almost all Convolutional Neural Network (CNN) structures. To evaluate our method, we propose Wild-360, a new 360° video saliency dataset, containing challenging videos with saliency heatmap annotations. In experiments, our method outperforms baseline methods in both speed and quality.
360°视频中的自动显著性预测对于视点指导应用程序(例如Facebook 360指南)至关重要。我们提出了一个(1)弱监督训练的时空网络,(2)为360°视角量身定制的时空网络。请注意,大多数现有方法的可扩展性较差,因为它们依赖于带注释的显著性图进行训练。最重要的是,它们将360°球体转换为2D图像(例如,单个等矩形图像或多个独立的Normal Field-of-View (NFoV)图像),这会引入失真和图像边界。相反,我们提出一个简单而有效的立方体填充(CP)技术如下。首先,我们使用透视投影在立方体的六个面上渲染360°视图。因此,它引入的失真很小。然后,我们在卷积、池化、卷积LSTM层中利用立方体上的面之间的连通性进行图像填充(即立方体填充),同时将所有六个面连接起来。这样,CP不引入图像边界,但几乎适用于所有卷积神经网络(CNN)结构。为了评估我们的方法,我们提出了Wild-360,一个新的360°视频显著性数据集,包含具有显著性热图注释的具有挑战性的视频。在实验中,我们的方法在速度和质量上都优于基线方法。
{"title":"Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos","authors":"Hsien-Tzu Cheng, Chun-Hung Chao, Jin-Dong Dong, Hao-Kai Wen, Tyng-Luh Liu, Min Sun","doi":"10.1109/CVPR.2018.00154","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00154","url":null,"abstract":"Automatic saliency prediction in 360° videos is critical for viewpoint guidance applications (e.g., Facebook 360 Guide). We propose a spatial-temporal network which is (1) weakly-supervised trained and (2) tailor-made for 360° viewing sphere. Note that most existing methods are less scalable since they rely on annotated saliency map for training. Most importantly, they convert 360° sphere to 2D images (e.g., a single equirectangular image or multiple separate Normal Field-of-View (NFoV) images) which introduces distortion and image boundaries. In contrast, we propose a simple and effective Cube Padding (CP) technique as follows. Firstly, we render the 360° view on six faces of a cube using perspective projection. Thus, it introduces very little distortion. Then, we concatenate all six faces while utilizing the connectivity between faces on the cube for image padding (i.e., Cube Padding) in convolution, pooling, convolutional LSTM layers. In this way, CP introduces no image boundary while being applicable to almost all Convolutional Neural Network (CNN) structures. To evaluate our method, we propose Wild-360, a new 360° video saliency dataset, containing challenging videos with saliency heatmap annotations. In experiments, our method outperforms baseline methods in both speed and quality.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"111 1","pages":"1420-1429"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80678844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 149
LiDAR-Video Driving Dataset: Learning Driving Policies Effectively 激光雷达-视频驾驶数据集:有效学习驾驶策略
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00615
Yiping Chen, Jingkang Wang, Jonathan Li, Cewu Lu, Zhipeng Luo, Han Xue, Cheng Wang
Learning autonomous-driving policies is one of the most challenging but promising tasks for computer vision. Most researchers believe that future research and applications should combine cameras, video recorders and laser scanners to obtain comprehensive semantic understanding of real traffic. However, current approaches only learn from large-scale videos, due to the lack of benchmarks that consist of precise laser-scanner data. In this paper, we are the first to propose a LiDAR-Video dataset, which provides large-scale high-quality point clouds scanned by a Velodyne laser, videos recorded by a dashboard camera and standard drivers' behaviors. Extensive experiments demonstrate that extra depth information help networks to determine driving policies indeed.
学习自动驾驶策略是计算机视觉领域最具挑战性但也最有前途的任务之一。大多数研究人员认为,未来的研究和应用应该结合摄像头、录像机和激光扫描仪,以获得对真实交通的全面语义理解。然而,目前的方法只能从大规模的视频中学习,因为缺乏由精确的激光扫描仪数据组成的基准。在本文中,我们首次提出了激光雷达视频数据集,该数据集提供了由Velodyne激光扫描的大规模高质量点云,仪表板摄像头记录的视频和标准驾驶员行为。大量的实验表明,额外的深度信息确实有助于网络确定驾驶策略。
{"title":"LiDAR-Video Driving Dataset: Learning Driving Policies Effectively","authors":"Yiping Chen, Jingkang Wang, Jonathan Li, Cewu Lu, Zhipeng Luo, Han Xue, Cheng Wang","doi":"10.1109/CVPR.2018.00615","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00615","url":null,"abstract":"Learning autonomous-driving policies is one of the most challenging but promising tasks for computer vision. Most researchers believe that future research and applications should combine cameras, video recorders and laser scanners to obtain comprehensive semantic understanding of real traffic. However, current approaches only learn from large-scale videos, due to the lack of benchmarks that consist of precise laser-scanner data. In this paper, we are the first to propose a LiDAR-Video dataset, which provides large-scale high-quality point clouds scanned by a Velodyne laser, videos recorded by a dashboard camera and standard drivers' behaviors. Extensive experiments demonstrate that extra depth information help networks to determine driving policies indeed.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"358 1","pages":"5870-5878"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76358036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 102
A Robust Method for Strong Rolling Shutter Effects Correction Using Lines with Automatic Feature Selection 一种基于自动特征选择的强滚动快门效果校正方法
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00504
Yizhen Lao, Omar Ait-Aider
We present a robust method which compensates RS distortions in a single image using a set of image curves, basing on the knowledge that they correspond to 3D straight lines. Unlike in existing work, no a priori knowledge about the line directions (e.g. Manhattan World assumption) is required. We first formulate a parametric equation for the projection of a 3D straight line viewed by a moving rolling shutter camera under a uniform motion model. Then we propose a method which efficiently estimates ego angular velocity separately from pose parameters, using at least 4 image curves. Moreover, we propose for the first time a RANSAC-like strategy to select image curves which really correspond to 3D straight lines and reject those corresponding to actual curves in 3D world. A comparative experimental study with both synthetic and real data from famous benchmarks shows that the proposed method outperforms all the existing techniques from the state-of-the-art.
我们提出了一种鲁棒的方法,该方法使用一组图像曲线来补偿单个图像中的RS失真,基于它们对应于3D直线的知识。与现有工作不同,不需要关于线方向的先验知识(例如曼哈顿世界假设)。我们首先建立了均匀运动模型下运动卷帘式相机所观察到的三维直线投影的参数方程。然后,我们提出了一种至少使用4条图像曲线独立于姿态参数有效估计自我角速度的方法。此外,我们首次提出了一种类似ransac的策略来选择真正对应于三维直线的图像曲线,并拒绝对应于三维世界中实际曲线的图像曲线。与著名基准测试的合成数据和真实数据进行对比实验研究表明,该方法优于现有的所有最新技术。
{"title":"A Robust Method for Strong Rolling Shutter Effects Correction Using Lines with Automatic Feature Selection","authors":"Yizhen Lao, Omar Ait-Aider","doi":"10.1109/CVPR.2018.00504","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00504","url":null,"abstract":"We present a robust method which compensates RS distortions in a single image using a set of image curves, basing on the knowledge that they correspond to 3D straight lines. Unlike in existing work, no a priori knowledge about the line directions (e.g. Manhattan World assumption) is required. We first formulate a parametric equation for the projection of a 3D straight line viewed by a moving rolling shutter camera under a uniform motion model. Then we propose a method which efficiently estimates ego angular velocity separately from pose parameters, using at least 4 image curves. Moreover, we propose for the first time a RANSAC-like strategy to select image curves which really correspond to 3D straight lines and reject those corresponding to actual curves in 3D world. A comparative experimental study with both synthetic and real data from famous benchmarks shows that the proposed method outperforms all the existing techniques from the state-of-the-art.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"50 1","pages":"4795-4803"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90838034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors 改进单阶段行人检测器的遮挡和硬负处理
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00107
Junhyug Noh, Soochan Lee, Beomsu Kim, Gunhee Kim
We propose methods of addressing two critical issues of pedestrian detection: (i) occlusion of target objects as false negative failure, and (ii) confusion with hard negative examples like vertical structures as false positive failure. Our solutions to these two problems are general and flexible enough to be applicable to any single-stage detection models. We implement our methods into four state-of-the-art single-stage models, including SqueezeDet+ [22], YOLOv2 [17], SSD [12], and DSSD [8]. We empirically validate that our approach indeed improves the performance of those four models on Caltech pedestrian [4] and CityPersons dataset [25]. Moreover, in some heavy occlusion settings, our approach achieves the best reported performance. Specifically, our two solutions are as follows. For better occlusion handling, we update the output tensors of single-stage models so that they include the prediction of part confidence scores, from which we compute a final occlusion-aware detection score. For reducing confusion with hard negative examples, we introduce average grid classifiers as post-refinement classifiers, trainable in an end-to-end fashion with little memory and time overhead (e.g. increase of 1-5 MB in memory and 1-2 ms in inference time).
我们提出了解决行人检测的两个关键问题的方法:(i)目标物体的遮挡作为假阴性失败,以及(ii)与硬阴性示例(如垂直结构)的混淆作为假阳性失败。我们对这两个问题的解决方案是通用的,足够灵活,适用于任何单级检测模型。我们将方法应用到四个最先进的单级模型中,包括SqueezeDet+[22]、YOLOv2[17]、SSD[12]和DSSD[8]。我们通过经验验证了我们的方法确实提高了这四个模型在Caltech pedestrian[4]和CityPersons数据集[25]上的性能。此外,在一些严重的遮挡设置中,我们的方法达到了最佳的报告性能。具体来说,我们的两个解决方案如下。为了更好地处理遮挡,我们更新了单阶段模型的输出张量,使它们包括部分置信度分数的预测,从中我们计算出最终的遮挡感知检测分数。为了减少与硬负示例的混淆,我们引入了平均网格分类器作为后细化分类器,以端到端方式训练,内存和时间开销很少(例如增加1-5 MB内存和1-2毫秒推理时间)。
{"title":"Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors","authors":"Junhyug Noh, Soochan Lee, Beomsu Kim, Gunhee Kim","doi":"10.1109/CVPR.2018.00107","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00107","url":null,"abstract":"We propose methods of addressing two critical issues of pedestrian detection: (i) occlusion of target objects as false negative failure, and (ii) confusion with hard negative examples like vertical structures as false positive failure. Our solutions to these two problems are general and flexible enough to be applicable to any single-stage detection models. We implement our methods into four state-of-the-art single-stage models, including SqueezeDet+ [22], YOLOv2 [17], SSD [12], and DSSD [8]. We empirically validate that our approach indeed improves the performance of those four models on Caltech pedestrian [4] and CityPersons dataset [25]. Moreover, in some heavy occlusion settings, our approach achieves the best reported performance. Specifically, our two solutions are as follows. For better occlusion handling, we update the output tensors of single-stage models so that they include the prediction of part confidence scores, from which we compute a final occlusion-aware detection score. For reducing confusion with hard negative examples, we introduce average grid classifiers as post-refinement classifiers, trainable in an end-to-end fashion with little memory and time overhead (e.g. increase of 1-5 MB in memory and 1-2 ms in inference time).","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"10 1","pages":"966-974"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90354310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
Recognizing Human Actions as the Evolution of Pose Estimation Maps 识别人类行为是姿态估计图的进化
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00127
Mengyuan Liu, Junsong Yuan
Most video-based action recognition approaches choose to extract features from the whole video to recognize actions. The cluttered background and non-action motions limit the performances of these methods, since they lack the explicit modeling of human body movements. With recent advances of human pose estimation, this work presents a novel method to recognize human action as the evolution of pose estimation maps. Instead of relying on the inaccurate human poses estimated from videos, we observe that pose estimation maps, the byproduct of pose estimation, preserve richer cues of human body to benefit action recognition. Specifically, the evolution of pose estimation maps can be decomposed as an evolution of heatmaps, e.g., probabilistic maps, and an evolution of estimated 2D human poses, which denote the changes of body shape and body pose, respectively. Considering the sparse property of heatmap, we develop spatial rank pooling to aggregate the evolution of heatmaps as a body shape evolution image. As body shape evolution image does not differentiate body parts, we design body guided sampling to aggregate the evolution of poses as a body pose evolution image. The complementary properties between both types of images are explored by deep convolutional neural networks to predict action label. Experiments on NTU RGB+D, UTD-MHAD and PennAction datasets verify the effectiveness of our method, which outperforms most state-of-the-art methods.
大多数基于视频的动作识别方法选择从整个视频中提取特征来识别动作。由于这些方法缺乏对人体运动的明确建模,因此混乱的背景和非动作运动限制了这些方法的性能。随着人体姿态估计的最新进展,本文提出了一种新的方法来识别人体动作作为姿态估计地图的演变。我们观察到,姿态估计的副产品姿态估计图保留了更丰富的人体线索,有利于动作识别,而不是依赖于从视频中估计的不准确的人体姿态。具体来说,姿态估计图的演化可以分解为热图(例如概率图)的演化和估计的二维人体姿态的演化,它们分别表示身体形状和身体姿态的变化。考虑到热图的稀疏性,我们开发了空间秩池,将热图的进化聚合为一个体型进化图像。由于身体形态进化图像不区分身体部位,我们设计了身体引导采样,将姿态进化聚合为身体姿态进化图像。利用深度卷积神经网络探索两类图像之间的互补特性来预测动作标签。在NTU RGB+D, UTD-MHAD和PennAction数据集上的实验验证了我们方法的有效性,它优于大多数最先进的方法。
{"title":"Recognizing Human Actions as the Evolution of Pose Estimation Maps","authors":"Mengyuan Liu, Junsong Yuan","doi":"10.1109/CVPR.2018.00127","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00127","url":null,"abstract":"Most video-based action recognition approaches choose to extract features from the whole video to recognize actions. The cluttered background and non-action motions limit the performances of these methods, since they lack the explicit modeling of human body movements. With recent advances of human pose estimation, this work presents a novel method to recognize human action as the evolution of pose estimation maps. Instead of relying on the inaccurate human poses estimated from videos, we observe that pose estimation maps, the byproduct of pose estimation, preserve richer cues of human body to benefit action recognition. Specifically, the evolution of pose estimation maps can be decomposed as an evolution of heatmaps, e.g., probabilistic maps, and an evolution of estimated 2D human poses, which denote the changes of body shape and body pose, respectively. Considering the sparse property of heatmap, we develop spatial rank pooling to aggregate the evolution of heatmaps as a body shape evolution image. As body shape evolution image does not differentiate body parts, we design body guided sampling to aggregate the evolution of poses as a body pose evolution image. The complementary properties between both types of images are explored by deep convolutional neural networks to predict action label. Experiments on NTU RGB+D, UTD-MHAD and PennAction datasets verify the effectiveness of our method, which outperforms most state-of-the-art methods.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"11 1","pages":"1159-1168"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89947080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 244
Deep Parametric Continuous Convolutional Neural Networks 深度参数连续卷积神经网络
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00274
Shenlong Wang, Simon Suo, Wei-Chiu Ma, A. Pokrovsky, R. Urtasun
Standard convolutional neural networks assume a grid structured input is available and exploit discrete convolutions as their fundamental building blocks. This limits their applicability to many real-world applications. In this paper we propose Parametric Continuous Convolution, a new learnable operator that operates over non-grid structured data. The key idea is to exploit parameterized kernel functions that span the full continuous vector space. This generalization allows us to learn over arbitrary data structures as long as their support relationship is computable. Our experiments show significant improvement over the state-of-the-art in point cloud segmentation of indoor and outdoor scenes, and lidar motion estimation of driving scenes.
标准卷积神经网络假设网格结构输入是可用的,并利用离散卷积作为其基本构建块。这限制了它们在许多实际应用中的适用性。本文提出了参数连续卷积算子,这是一种新的可学习算子,用于非网格结构化数据。关键思想是利用跨越整个连续向量空间的参数化核函数。这种泛化允许我们在任意数据结构上学习,只要它们的支持关系是可计算的。我们的实验表明,在室内和室外场景的点云分割以及驾驶场景的激光雷达运动估计方面,我们的技术都有了显著的改进。
{"title":"Deep Parametric Continuous Convolutional Neural Networks","authors":"Shenlong Wang, Simon Suo, Wei-Chiu Ma, A. Pokrovsky, R. Urtasun","doi":"10.1109/CVPR.2018.00274","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00274","url":null,"abstract":"Standard convolutional neural networks assume a grid structured input is available and exploit discrete convolutions as their fundamental building blocks. This limits their applicability to many real-world applications. In this paper we propose Parametric Continuous Convolution, a new learnable operator that operates over non-grid structured data. The key idea is to exploit parameterized kernel functions that span the full continuous vector space. This generalization allows us to learn over arbitrary data structures as long as their support relationship is computable. Our experiments show significant improvement over the state-of-the-art in point cloud segmentation of indoor and outdoor scenes, and lidar motion estimation of driving scenes.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"44 1","pages":"2589-2597"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87803998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 390
Focus Manipulation Detection via Photometric Histogram Analysis 通过光度直方图分析的焦点操纵检测
Pub Date : 2018-06-01 DOI: 10.1109/CVPR.2018.00180
Can Chen, Scott McCloskey, Jingyi Yu
With the rise of misinformation spread via social media channels, enabled by the increasing automation and realism of image manipulation tools, image forensics is an increasingly relevant problem. Classic image forensic methods leverage low-level cues such as metadata, sensor noise fingerprints, and others that are easily fooled when the image is re-encoded upon upload to facebook, etc. This necessitates the use of higher-level physical and semantic cues that, once hard to estimate reliably in the wild, have become more effective due to the increasing power of computer vision. In particular, we detect manipulations introduced by artificial blurring of the image, which creates inconsistent photometric relationships between image intensity and various cues. We achieve 98% accuracy on the most challenging cases in a new dataset of blur manipulations, where the blur is geometrically correct and consistent with the scene's physical arrangement. Such manipulations are now easily generated, for instance, by smartphone cameras having hardware to measure depth, e.g. 'Portrait Mode' of the iPhone7Plus. We also demonstrate good performance on a challenge dataset evaluating a wider range of manipulations in imagery representing 'in the wild' conditions.
随着通过社交媒体渠道传播的错误信息的增加,以及图像处理工具的日益自动化和真实感,图像取证成为一个越来越重要的问题。经典的图像取证方法利用元数据、传感器噪声指纹等低级线索,当图像在上传到facebook后被重新编码时,这些线索很容易被欺骗。这需要使用更高层次的物理和语义线索,这些线索曾经很难在野外可靠地估计,但由于计算机视觉的日益强大,它们变得更加有效。特别是,我们检测到通过人为模糊图像引入的操作,这会在图像强度和各种线索之间产生不一致的光度关系。在新的模糊操作数据集中,我们在最具挑战性的情况下达到98%的准确率,其中模糊在几何上是正确的,并且与场景的物理排列一致。这种操作现在很容易产生,例如,通过智能手机相机的硬件来测量深度,例如。iPhone7Plus的“人像模式”。我们还在一个挑战数据集上展示了良好的性能,该数据集评估了代表“野外”条件的图像中更广泛的操作。
{"title":"Focus Manipulation Detection via Photometric Histogram Analysis","authors":"Can Chen, Scott McCloskey, Jingyi Yu","doi":"10.1109/CVPR.2018.00180","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00180","url":null,"abstract":"With the rise of misinformation spread via social media channels, enabled by the increasing automation and realism of image manipulation tools, image forensics is an increasingly relevant problem. Classic image forensic methods leverage low-level cues such as metadata, sensor noise fingerprints, and others that are easily fooled when the image is re-encoded upon upload to facebook, etc. This necessitates the use of higher-level physical and semantic cues that, once hard to estimate reliably in the wild, have become more effective due to the increasing power of computer vision. In particular, we detect manipulations introduced by artificial blurring of the image, which creates inconsistent photometric relationships between image intensity and various cues. We achieve 98% accuracy on the most challenging cases in a new dataset of blur manipulations, where the blur is geometrically correct and consistent with the scene's physical arrangement. Such manipulations are now easily generated, for instance, by smartphone cameras having hardware to measure depth, e.g. 'Portrait Mode' of the iPhone7Plus. We also demonstrate good performance on a challenge dataset evaluating a wider range of manipulations in imagery representing 'in the wild' conditions.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"95 1","pages":"1674-1682"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85718898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1