DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection

IF 8.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE transactions on neural networks and learning systems Pub Date : 2024-12-18 DOI:10.1109/TNNLS.2024.3516033
Henghao Zhao;Kevin Qinghong Lin;Rui Yan;Zechao Li
{"title":"DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection","authors":"Henghao Zhao;Kevin Qinghong Lin;Rui Yan;Zechao Li","doi":"10.1109/TNNLS.2024.3516033","DOIUrl":null,"url":null,"abstract":"Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Most existing methods approach these challenges from a discriminative learning perspective, focusing on learning the correspondence between query and activity boundary locations through complex cross-modal interactions. However, the continuous nature of video content often results in unclear boundaries between temporal events. This boundary ambiguity may confuse models, resulting in the subpar performance in predicting target boundaries. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, the Gaussian noise is added to corrupt the ground truth (GT), with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights, and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 8","pages":"14522-14535"},"PeriodicalIF":8.9000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10806586/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Most existing methods approach these challenges from a discriminative learning perspective, focusing on learning the correspondence between query and activity boundary locations through complex cross-modal interactions. However, the continuous nature of video content often results in unclear boundaries between temporal events. This boundary ambiguity may confuse models, resulting in the subpar performance in predicting target boundaries. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, the Gaussian noise is added to corrupt the ground truth (GT), with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights, and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DiffusionVMR:用于联合视频瞬间检索和亮点检测的扩散模型
在当前视频内容激增的时代,视频时刻检索和突出检测备受关注,其目的是基于用户特定查询来定位时刻和估计片段相关性。大多数现有方法都是从判别学习的角度来解决这些挑战的,重点是通过复杂的跨模态交互来学习查询和活动边界位置之间的对应关系。然而,视频内容的连续性往往导致时间事件之间的界限不明确。这种边界模糊可能会混淆模型,导致预测目标边界的性能欠佳。为了缓解这一问题,我们提出从去噪生成的角度来共同解决这两个任务。此外,通过从粗到细的迭代细化,可以清晰地定位目标边界。具体来说,我们提出了一个新的框架——DiffusionVMR,通过结合扩散模型,将这两个任务重新定义为一个统一的条件去噪生成过程。在训练过程中,加入高斯噪声来破坏地面真值(GT),并产生带噪声的候选值作为输入。该模型经过训练以逆转这种噪声添加过程。在推理阶段,DiffusionVMR直接从高斯噪声开始,并逐步将噪声的建议细化为有意义的输出。值得注意的是,所提出的DiffusionVMR继承了扩散模型的优点,允许在推理过程中迭代细化结果,增强了从粗糙到精细的边界转换。此外,对DiffusionVMR的训练和推理进行了解耦。在DiffusionVMR中,在推理过程中可以使用任意设置,而不需要与训练阶段保持一致。在五个广泛使用的基准(即QVHighlight, Charades-STA, TACoS, YouTubeHighlights和TVSum)上进行的大量实验跨两个任务(时刻检索和/或高光检测)证明了所提出的DiffusionVMR的有效性和灵活性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE transactions on neural networks and learning systems
IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
CiteScore
23.80
自引率
9.60%
发文量
2102
审稿时长
3-8 weeks
期刊介绍: The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.
期刊最新文献
TSFA: A Two-Stage Feature Alignment Method for Unsupervised Open-Set Domain Adaptation in Time-Series Classification. Incomplete Multimodal Federated Learning via Masking and Contrasting Prototypes. Seeing What Few-Shot Learners See: Contrastive Cross-Class Attribution for Explainability. Accelerated Reinforcement Learning With Verifiable Excitation for Cubic Convergence. Learning Optimal Policies With Local Observations for Cooperative Multiagent Reinforcement Learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1