DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE transactions on neural networks and learning systems Pub Date : 2024-12-18 DOI:10.1109/TNNLS.2024.3516033

Henghao Zhao;Kevin Qinghong Lin;Rui Yan;Zechao Li

{"title":"DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection","authors":"Henghao Zhao;Kevin Qinghong Lin;Rui Yan;Zechao Li","doi":"10.1109/TNNLS.2024.3516033","DOIUrl":null,"url":null,"abstract":"Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Most existing methods approach these challenges from a discriminative learning perspective, focusing on learning the correspondence between query and activity boundary locations through complex cross-modal interactions. However, the continuous nature of video content often results in unclear boundaries between temporal events. This boundary ambiguity may confuse models, resulting in the subpar performance in predicting target boundaries. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, the Gaussian noise is added to corrupt the ground truth (GT), with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights, and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 8","pages":"14522-14535"},"PeriodicalIF":8.9000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10806586/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Most existing methods approach these challenges from a discriminative learning perspective, focusing on learning the correspondence between query and activity boundary locations through complex cross-modal interactions. However, the continuous nature of video content often results in unclear boundaries between temporal events. This boundary ambiguity may confuse models, resulting in the subpar performance in predicting target boundaries. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, the Gaussian noise is added to corrupt the ground truth (GT), with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights, and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DiffusionVMR：用于联合视频瞬间检索和亮点检测的扩散模型

在当前视频内容激增的时代，视频时刻检索和突出检测备受关注，其目的是基于用户特定查询来定位时刻和估计片段相关性。大多数现有方法都是从判别学习的角度来解决这些挑战的，重点是通过复杂的跨模态交互来学习查询和活动边界位置之间的对应关系。然而，视频内容的连续性往往导致时间事件之间的界限不明确。这种边界模糊可能会混淆模型，导致预测目标边界的性能欠佳。为了缓解这一问题，我们提出从去噪生成的角度来共同解决这两个任务。此外，通过从粗到细的迭代细化，可以清晰地定位目标边界。具体来说，我们提出了一个新的框架——DiffusionVMR，通过结合扩散模型，将这两个任务重新定义为一个统一的条件去噪生成过程。在训练过程中，加入高斯噪声来破坏地面真值（GT），并产生带噪声的候选值作为输入。该模型经过训练以逆转这种噪声添加过程。在推理阶段，DiffusionVMR直接从高斯噪声开始，并逐步将噪声的建议细化为有意义的输出。值得注意的是，所提出的DiffusionVMR继承了扩散模型的优点，允许在推理过程中迭代细化结果，增强了从粗糙到精细的边界转换。此外，对DiffusionVMR的训练和推理进行了解耦。在DiffusionVMR中，在推理过程中可以使用任意设置，而不需要与训练阶段保持一致。在五个广泛使用的基准（即QVHighlight， Charades-STA, TACoS， YouTubeHighlights和TVSum）上进行的大量实验跨两个任务（时刻检索和/或高光检测）证明了所提出的DiffusionVMR的有效性和灵活性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.