基于蒙面视觉预训练的真实世界机器人学习

Conference on Robot Learning Pub Date : 2022-10-06 DOI:10.48550/arXiv.2210.03109

Ilija Radosavovic, Tete Xiao, Stephen James, P. Abbeel, J. Malik, Trevor Darrell

{"title":"基于蒙面视觉预训练的真实世界机器人学习","authors":"Ilija Radosavovic, Tete Xiao, Stephen James, P. Abbeel, J. Malik, Trevor Darrell","doi":"10.48550/arXiv.2210.03109","DOIUrl":null,"url":null,"abstract":"In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"95","resultStr":"{\"title\":\"Real-World Robot Learning with Masked Visual Pre-training\",\"authors\":\"Ilija Radosavovic, Tete Xiao, Stephen James, P. Abbeel, J. Malik, Trevor Darrell\",\"doi\":\"10.48550/arXiv.2210.03109\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.\",\"PeriodicalId\":273870,\"journal\":{\"name\":\"Conference on Robot Learning\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"95\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Conference on Robot Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.03109\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference on Robot Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.03109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 95

摘要

在这项工作中，我们探索了来自现实世界机器人任务的各种野外视频图像的自监督视觉预训练。与之前的工作一样，我们的视觉表征通过掩模自动编码器(MAE)进行预训练，冻结，然后传递到可学习的控制模块。与之前的工作不同，我们表明预训练的表征在一系列现实世界的机器人任务和实施例中是有效的。我们发现我们的编码器始终优于CLIP(高达75%)，监督ImageNet预训练(高达81%)和从头开始训练(高达81%)。最后，我们在来自互联网和自我中心视频的450万图像的大量集合上训练了一个307M参数的视觉转换器，并清楚地展示了缩放视觉预训练对机器人学习的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Real-World Robot Learning with Masked Visual Pre-training

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Conference on Robot Learning

自引率

0.00%

发文量

期刊最新文献

MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models Lidar Line Selection with Spatially-Aware Shapley Value for Cost-Efficient Depth Completion Safe Robot Learning in Assistive Devices through Neural Network Repair COACH: Cooperative Robot Teaching Learning Goal-Conditioned Policies Offline with Self-Supervised Reward Shaping