VILP: Imitation Learning With Latent Video Planning

IF 5.3 2区 计算机科学 Q2 ROBOTICS IEEE Robotics and Automation Letters Pub Date : 2025-02-14 DOI:10.1109/LRA.2025.3542317
Zhengtong Xu;Qiang Qiu;Yu She
{"title":"VILP: Imitation Learning With Latent Video Planning","authors":"Zhengtong Xu;Qiang Qiu;Yu She","doi":"10.1109/LRA.2025.3542317","DOIUrl":null,"url":null,"abstract":"In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This letter introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96 × 160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 4","pages":"3350-3357"},"PeriodicalIF":5.3000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10887293/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0

Abstract

In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This letter introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96 × 160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
VILP:模仿学习与潜在视频规划
在生成式人工智能时代,将视频生成模型集成到机器人技术中,为通用机器人代理开辟了新的可能性。这封信介绍了带有潜在视频规划(VILP)的模仿学习。我们提出了一种潜在视频扩散模型来生成具有较好时间一致性的预测机器人视频。我们的方法能够从多个视图生成高度时间对齐的视频,这对机器人策略学习至关重要。我们的视频生成模型非常省时。例如,它可以从两个不同的角度生成视频,每个角度由6帧组成,分辨率为96 × 160像素,速率为5赫兹。在实验中,我们证明VILP在几个指标上优于现有的视频生成机器人策略:训练成本、推理速度、生成视频的时间一致性和策略的性能。我们还将我们的方法与其他模仿学习方法进行了比较。我们的研究结果表明,VILP可以在保持稳健性能的同时减少对大量高质量任务特定机器人动作数据的依赖。此外,VILP在表示多模态动作分布方面具有强大的能力。我们的论文提供了一个如何有效地将视频生成模型集成到机器人策略中的实际示例,可能为相关领域和方向提供见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Robotics and Automation Letters
IEEE Robotics and Automation Letters Computer Science-Computer Science Applications
CiteScore
9.60
自引率
15.40%
发文量
1428
期刊介绍: The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.
期刊最新文献
A Valve-Less Electro-Hydrostatic Powered Prosthetic Foot to Improve the Power Efficiency During Walking Deep Learning-Based Fourier Registration for Forward-Looking Sonar Odometry in Texture-Sparse Underwater Environments Towards Quadrupedal Jumping and Walking for Dynamic Locomotion Using Reinforcement Learning Sim2Real Domain Shifting: Hyper-Realistic Data Generation for Object Segmentation IEEE Robotics and Automation Letters Information for Authors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1