{"title":"DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection","authors":"Henghao Zhao;Kevin Qinghong Lin;Rui Yan;Zechao Li","doi":"10.1109/TNNLS.2024.3516033","DOIUrl":null,"url":null,"abstract":"Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Most existing methods approach these challenges from a discriminative learning perspective, focusing on learning the correspondence between query and activity boundary locations through complex cross-modal interactions. However, the continuous nature of video content often results in unclear boundaries between temporal events. This boundary ambiguity may confuse models, resulting in the subpar performance in predicting target boundaries. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, the Gaussian noise is added to corrupt the ground truth (GT), with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights, and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 8","pages":"14522-14535"},"PeriodicalIF":8.9000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10806586/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Most existing methods approach these challenges from a discriminative learning perspective, focusing on learning the correspondence between query and activity boundary locations through complex cross-modal interactions. However, the continuous nature of video content often results in unclear boundaries between temporal events. This boundary ambiguity may confuse models, resulting in the subpar performance in predicting target boundaries. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, the Gaussian noise is added to corrupt the ground truth (GT), with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights, and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.