VILP: Imitation Learning With Latent Video Planning

IF 5.3 2区计算机科学 Q2 ROBOTICS IEEE Robotics and Automation Letters Pub Date : 2025-02-14 DOI:10.1109/LRA.2025.3542317

Zhengtong Xu;Qiang Qiu;Yu She

{"title":"VILP: Imitation Learning With Latent Video Planning","authors":"Zhengtong Xu;Qiang Qiu;Yu She","doi":"10.1109/LRA.2025.3542317","DOIUrl":null,"url":null,"abstract":"In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This letter introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96 × 160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 4","pages":"3350-3357"},"PeriodicalIF":5.3000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10887293/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This letter introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96 × 160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

VILP：模仿学习与潜在视频规划

在生成式人工智能时代，将视频生成模型集成到机器人技术中，为通用机器人代理开辟了新的可能性。这封信介绍了带有潜在视频规划（VILP）的模仿学习。我们提出了一种潜在视频扩散模型来生成具有较好时间一致性的预测机器人视频。我们的方法能够从多个视图生成高度时间对齐的视频，这对机器人策略学习至关重要。我们的视频生成模型非常省时。例如，它可以从两个不同的角度生成视频，每个角度由6帧组成，分辨率为96 × 160像素，速率为5赫兹。在实验中，我们证明VILP在几个指标上优于现有的视频生成机器人策略：训练成本、推理速度、生成视频的时间一致性和策略的性能。我们还将我们的方法与其他模仿学习方法进行了比较。我们的研究结果表明，VILP可以在保持稳健性能的同时减少对大量高质量任务特定机器人动作数据的依赖。此外，VILP在表示多模态动作分布方面具有强大的能力。我们的论文提供了一个如何有效地将视频生成模型集成到机器人策略中的实际示例，可能为相关领域和方向提供见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Robotics and Automation Letters Computer Science-Computer Science Applications

CiteScore

9.60

自引率

15.40%

发文量

1428

期刊介绍： The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.