Slim DensePose: Thrifty Learning From Sparse Annotations and Motion Cues

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2019-06-01 DOI:10.1109/CVPR.2019.01117

N. Neverova, James Thewlis, R. Güler, Iasonas Kokkinos, A. Vedaldi

{"title":"Slim DensePose: Thrifty Learning From Sparse Annotations and Motion Cues","authors":"N. Neverova, James Thewlis, R. Güler, Iasonas Kokkinos, A. Vedaldi","doi":"10.1109/CVPR.2019.01117","DOIUrl":null,"url":null,"abstract":"DensePose supersedes traditional landmark detectors by densely mapping image pixels to body surface coordinates. This power, however, comes at a greatly increased annotation cost, as supervising the model requires to manually label hundreds of points per pose instance. In this work, we thus seek methods to significantly slim down the DensePose annotations, proposing more efficient data collection strategies. In particular, we demonstrate that if annotations are collected in video frames, their efficacy can be multiplied for free by using motion cues. To explore this idea, we introduce DensePose-Track, a dataset of videos where selected frames are annotated in the traditional DensePose manner. Then, building on geometric properties of the DensePose mapping, we use the video dynamic to propagate ground-truth annotations in time as well as to learn from Siamese equivariance constraints. Having performed exhaustive empirical evaluation of various data annotation and learning strategies, we demonstrate that doing so can deliver significantly improved pose estimation results over strong baselines. However, despite what is suggested by some recent works, we show that merely synthesizing motion patterns by applying geometric transformations to isolated frames is significantly less effective, and that motion cues help much more when they are extracted from videos.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"28 1","pages":"10907-10915"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2019.01117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

DensePose supersedes traditional landmark detectors by densely mapping image pixels to body surface coordinates. This power, however, comes at a greatly increased annotation cost, as supervising the model requires to manually label hundreds of points per pose instance. In this work, we thus seek methods to significantly slim down the DensePose annotations, proposing more efficient data collection strategies. In particular, we demonstrate that if annotations are collected in video frames, their efficacy can be multiplied for free by using motion cues. To explore this idea, we introduce DensePose-Track, a dataset of videos where selected frames are annotated in the traditional DensePose manner. Then, building on geometric properties of the DensePose mapping, we use the video dynamic to propagate ground-truth annotations in time as well as to learn from Siamese equivariance constraints. Having performed exhaustive empirical evaluation of various data annotation and learning strategies, we demonstrate that doing so can deliver significantly improved pose estimation results over strong baselines. However, despite what is suggested by some recent works, we show that merely synthesizing motion patterns by applying geometric transformations to isolated frames is significantly less effective, and that motion cues help much more when they are extracted from videos.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Slim DensePose:从稀疏注释和运动线索中节俭地学习

DensePose通过将图像像素密集映射到人体表面坐标来取代传统的地标探测器。然而，这种能力带来了极大的注释成本，因为监督模型需要手动标记每个姿态实例中的数百个点。因此，在这项工作中，我们寻求显着精简DensePose注释的方法，提出更有效的数据收集策略。特别是，我们证明，如果在视频帧中收集注释，它们的功效可以通过使用动作线索免费增加。为了探索这个想法，我们引入了DensePose- track，这是一个视频数据集，其中选择的帧以传统的DensePose方式进行注释。然后，基于DensePose映射的几何属性，我们使用视频动态及时传播ground-truth注释，并从Siamese等方差约束中学习。通过对各种数据标注和学习策略进行详尽的经验评估，我们证明这样做可以在强基线上显著改善姿态估计结果。然而，尽管最近的一些研究表明，仅仅通过对孤立帧应用几何变换来合成运动模式的效果明显较差，而从视频中提取运动线索的帮助更大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量

期刊最新文献

Multi-Level Context Ultra-Aggregation for Stereo Matching Leveraging Heterogeneous Auxiliary Tasks to Assist Crowd Counting Incremental Object Learning From Contiguous Views Progressive Teacher-Student Learning for Early Action Prediction Inverse Discriminative Networks for Handwritten Signature Verification