PDPP: Projected Diffusion for Procedure Planning in Instructional Videos

Hanlin Wang;Yilu Wu;Sheng Guo;Limin Wang
{"title":"PDPP: Projected Diffusion for Procedure Planning in Instructional Videos","authors":"Hanlin Wang;Yilu Wu;Sheng Guo;Limin Wang","doi":"10.1109/TPAMI.2024.3518762","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal. Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision to make autoregressive planning, resulting in complex learning schemes and expensive annotation costs. To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework, coined as PDPP (Projected Diffusion model for Procedure Planning), to directly model the whole action sequence distribution with task label as supervision instead. Our core idea is to treat procedure planning as a distribution fitting problem under the given observations, thus transform the planning problem to a sampling process from this distribution during inference. The diffusion-based modeling approach also effectively addresses the uncertainty issue in procedure planning. Based on PDPP, we further apply joint training to our framework to generate plans with varying horizon lengths using a single model and reduce the number of training parameters required. We instantiate our PDPP with three popular diffusion models and investigate a serious of condition-introducing methods in our framework, including condition embeddings, Mixture-of-Experts (MoEs), two-stage prediction and Classifier-Free Guidance strategy. Finally, we apply our PDPP to the Visual Planners for human Assistance (VPA) problem which requires the goal specified in natural language rather than visual observation. We conduct experiments on challenging datasets of different scales and our PDPP model achieves the state-of-the-art performance on multiple metrics, even compared with those strongly-supervised counterparts. These results further demonstratethe effectiveness and generalization ability of our model.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2107-2124"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10804102/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In this paper, we study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal. Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision to make autoregressive planning, resulting in complex learning schemes and expensive annotation costs. To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework, coined as PDPP (Projected Diffusion model for Procedure Planning), to directly model the whole action sequence distribution with task label as supervision instead. Our core idea is to treat procedure planning as a distribution fitting problem under the given observations, thus transform the planning problem to a sampling process from this distribution during inference. The diffusion-based modeling approach also effectively addresses the uncertainty issue in procedure planning. Based on PDPP, we further apply joint training to our framework to generate plans with varying horizon lengths using a single model and reduce the number of training parameters required. We instantiate our PDPP with three popular diffusion models and investigate a serious of condition-introducing methods in our framework, including condition embeddings, Mixture-of-Experts (MoEs), two-stage prediction and Classifier-Free Guidance strategy. Finally, we apply our PDPP to the Visual Planners for human Assistance (VPA) problem which requires the goal specified in natural language rather than visual observation. We conduct experiments on challenging datasets of different scales and our PDPP model achieves the state-of-the-art performance on multiple metrics, even compared with those strongly-supervised counterparts. These results further demonstratethe effectiveness and generalization ability of our model.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PDPP:教学视频中程序规划的预测扩散量
在本文中,我们研究了教学视频中的程序规划问题,其目的是根据当前的视觉观察和期望的目标制定一个计划(即一系列动作)。以前的工作将此作为序列建模问题,并利用中间视觉观察或语言指令作为监督来进行自回归规划,导致复杂的学习方案和昂贵的注释成本。为了避免计划自回归导致的中间监督注释和错误积累,我们提出了一个基于扩散的框架,称为PDPP (Projected Diffusion model for Procedure planning),以任务标签作为监督直接对整个动作序列分布进行建模。我们的核心思想是将程序规划视为给定观测值下的分布拟合问题,从而在推理过程中将规划问题从该分布转化为采样过程。基于扩散的建模方法也有效地解决了工艺规划中的不确定性问题。在PDPP的基础上,我们进一步将联合训练应用到我们的框架中,使用单个模型生成不同水平长度的计划,并减少所需训练参数的数量。我们用三种流行的扩散模型实例化了我们的PDPP,并在我们的框架中研究了一系列引入条件的方法,包括条件嵌入、专家混合(MoEs)、两阶段预测和无分类器指导策略。最后,我们将PDPP应用于人类辅助视觉规划者(Visual Planners for human Assistance, VPA)问题,该问题需要以自然语言指定目标,而不是视觉观察。我们在不同规模的具有挑战性的数据集上进行实验,我们的PDPP模型在多个指标上达到了最先进的性能,甚至与那些强监督的模型相比也是如此。这些结果进一步证明了该模型的有效性和泛化能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GrowSP++: Growing Superpoints and Primitives for Unsupervised 3D Semantic Segmentation. Unsupervised Gaze Representation Learning by Switching Features. H2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers. MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection. Parse Trees Guided LLM Prompt Compression.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1