SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?

J. McCormac, Ankur Handa, Stefan Leutenegger, A. Davison
{"title":"SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?","authors":"J. McCormac, Ankur Handa, Stefan Leutenegger, A. Davison","doi":"10.1109/ICCV.2017.292","DOIUrl":null,"url":null,"abstract":"We introduce SceneNet RGB-D, a dataset providing pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection. It also provides perfect camera poses and depth data, allowing investigation into geometric computer vision problems such as optical flow, camera pose estimation, and 3D scene labelling tasks. Random sampling permits virtually unlimited scene configurations, and here we provide 5M rendered RGB-D images from 16K randomly generated 3D trajectories in synthetic layouts, with random but physically simulated object configurations. We compare the semantic segmentation performance of network weights produced from pretraining on RGB images from our dataset against generic VGG-16 ImageNet weights. After fine-tuning on the SUN RGB-D and NYUv2 real-world datasets we find in both cases that the synthetically pre-trained network outperforms the VGG-16 weights. When synthetic pre-training includes a depth channel (something ImageNet cannot natively provide) the performance is greater still. This suggests that large-scale high-quality synthetic RGB datasets with task-specific labels can be more useful for pretraining than real-world generic pre-training such as ImageNet. We host the dataset at http://robotvault. bitbucket.io/scenenet-rgbd.html.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"60 1","pages":"2697-2706"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"263","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV.2017.292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 263

Abstract

We introduce SceneNet RGB-D, a dataset providing pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection. It also provides perfect camera poses and depth data, allowing investigation into geometric computer vision problems such as optical flow, camera pose estimation, and 3D scene labelling tasks. Random sampling permits virtually unlimited scene configurations, and here we provide 5M rendered RGB-D images from 16K randomly generated 3D trajectories in synthetic layouts, with random but physically simulated object configurations. We compare the semantic segmentation performance of network weights produced from pretraining on RGB images from our dataset against generic VGG-16 ImageNet weights. After fine-tuning on the SUN RGB-D and NYUv2 real-world datasets we find in both cases that the synthetically pre-trained network outperforms the VGG-16 weights. When synthetic pre-training includes a depth channel (something ImageNet cannot natively provide) the performance is greater still. This suggests that large-scale high-quality synthetic RGB datasets with task-specific labels can be more useful for pretraining than real-world generic pre-training such as ImageNet. We host the dataset at http://robotvault. bitbucket.io/scenenet-rgbd.html.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SceneNet RGB-D: 5M合成图像在室内分割上能胜过通用ImageNet预训练吗?
我们介绍了SceneNet RGB-D,这是一个为场景理解问题(如语义分割、实例分割和目标检测)提供像素完美的地面真相的数据集。它还提供完美的相机姿势和深度数据,允许调查几何计算机视觉问题,如光流,相机姿势估计和3D场景标记任务。随机抽样允许几乎无限的场景配置,在这里,我们提供了5M渲染RGB-D图像从16K随机生成的3D轨迹合成布局,随机但物理模拟对象配置。我们比较了来自我们数据集的RGB图像的预训练产生的网络权重与通用VGG-16 ImageNet权重的语义分割性能。在对SUN RGB-D和NYUv2真实数据集进行微调后,我们发现在这两种情况下,综合预训练的网络优于VGG-16权重。当合成预训练包括深度通道(ImageNet本身无法提供的东西)时,性能仍然更好。这表明具有任务特定标签的大规模高质量合成RGB数据集对于预训练可能比真实世界的通用预训练(如ImageNet)更有用。我们将数据集托管在http://robotvault上。bitbucket.io / scenenet-rgbd.html。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Visual Odometry for Pixel Processor Arrays Rolling Shutter Correction in Manhattan World Sketching with Style: Visual Search with Sketches and Aesthetic Context Active Learning for Human Pose Estimation Attribute-Enhanced Face Recognition with Neural Tensor Fusion Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1