{"title":"PDPP: Projected Diffusion for Procedure Planning in Instructional Videos","authors":"Hanlin Wang;Yilu Wu;Sheng Guo;Limin Wang","doi":"10.1109/TPAMI.2024.3518762","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal. Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision to make autoregressive planning, resulting in complex learning schemes and expensive annotation costs. To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework, coined as PDPP (Projected Diffusion model for Procedure Planning), to directly model the whole action sequence distribution with task label as supervision instead. Our core idea is to treat procedure planning as a distribution fitting problem under the given observations, thus transform the planning problem to a sampling process from this distribution during inference. The diffusion-based modeling approach also effectively addresses the uncertainty issue in procedure planning. Based on PDPP, we further apply joint training to our framework to generate plans with varying horizon lengths using a single model and reduce the number of training parameters required. We instantiate our PDPP with three popular diffusion models and investigate a serious of condition-introducing methods in our framework, including condition embeddings, Mixture-of-Experts (MoEs), two-stage prediction and Classifier-Free Guidance strategy. Finally, we apply our PDPP to the Visual Planners for human Assistance (VPA) problem which requires the goal specified in natural language rather than visual observation. We conduct experiments on challenging datasets of different scales and our PDPP model achieves the state-of-the-art performance on multiple metrics, even compared with those strongly-supervised counterparts. These results further demonstratethe effectiveness and generalization ability of our model.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2107-2124"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10804102/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal. Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision to make autoregressive planning, resulting in complex learning schemes and expensive annotation costs. To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework, coined as PDPP (Projected Diffusion model for Procedure Planning), to directly model the whole action sequence distribution with task label as supervision instead. Our core idea is to treat procedure planning as a distribution fitting problem under the given observations, thus transform the planning problem to a sampling process from this distribution during inference. The diffusion-based modeling approach also effectively addresses the uncertainty issue in procedure planning. Based on PDPP, we further apply joint training to our framework to generate plans with varying horizon lengths using a single model and reduce the number of training parameters required. We instantiate our PDPP with three popular diffusion models and investigate a serious of condition-introducing methods in our framework, including condition embeddings, Mixture-of-Experts (MoEs), two-stage prediction and Classifier-Free Guidance strategy. Finally, we apply our PDPP to the Visual Planners for human Assistance (VPA) problem which requires the goal specified in natural language rather than visual observation. We conduct experiments on challenging datasets of different scales and our PDPP model achieves the state-of-the-art performance on multiple metrics, even compared with those strongly-supervised counterparts. These results further demonstratethe effectiveness and generalization ability of our model.
在本文中,我们研究了教学视频中的程序规划问题,其目的是根据当前的视觉观察和期望的目标制定一个计划(即一系列动作)。以前的工作将此作为序列建模问题,并利用中间视觉观察或语言指令作为监督来进行自回归规划,导致复杂的学习方案和昂贵的注释成本。为了避免计划自回归导致的中间监督注释和错误积累,我们提出了一个基于扩散的框架,称为PDPP (Projected Diffusion model for Procedure planning),以任务标签作为监督直接对整个动作序列分布进行建模。我们的核心思想是将程序规划视为给定观测值下的分布拟合问题,从而在推理过程中将规划问题从该分布转化为采样过程。基于扩散的建模方法也有效地解决了工艺规划中的不确定性问题。在PDPP的基础上,我们进一步将联合训练应用到我们的框架中,使用单个模型生成不同水平长度的计划,并减少所需训练参数的数量。我们用三种流行的扩散模型实例化了我们的PDPP,并在我们的框架中研究了一系列引入条件的方法,包括条件嵌入、专家混合(MoEs)、两阶段预测和无分类器指导策略。最后,我们将PDPP应用于人类辅助视觉规划者(Visual Planners for human Assistance, VPA)问题,该问题需要以自然语言指定目标,而不是视觉观察。我们在不同规模的具有挑战性的数据集上进行实验,我们的PDPP模型在多个指标上达到了最先进的性能,甚至与那些强监督的模型相比也是如此。这些结果进一步证明了该模型的有效性和泛化能力。