Question-Answering Dense Video Events

Hangyu Qin, Junbin Xiao, Angela Yao
{"title":"Question-Answering Dense Video Events","authors":"Hangyu Qin, Junbin Xiao, Angela Yao","doi":"arxiv-2409.04388","DOIUrl":null,"url":null,"abstract":"Multimodal Large Language Models (MLLMs) have shown excellent performance in\nquestion-answering of single-event videos. In this paper, we present\nquestion-answering dense video events, a novel task that requires answering and\ngrounding the dense-event questions in long videos, thus challenging MLLMs to\nfaithfully comprehend and reason about multiple events occurring over extended\ntime periods. To facilitate the study, we construct DeVE-QA - a dataset\nfeaturing 78K questions about 26K events on 10.6K long videos. We then\nbenchmark and show that existing MLLMs excelling at single-event QA struggle to\nperform well in DeVE-QA. For improvement, we propose DeVi, a novel\ntraining-free MLLM approach that highlights a hierarchical captioning module, a\ntemporal event memory module, and a self-consistency checking module to\nrespectively detect, contextualize and memorize, and ground dense-events in\nlong videos for question answering. Extensive experiments show that DeVi is\nsuperior at answering dense-event questions and grounding relevant video\nmoments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1\npercent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQA\nrespectively.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"392 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04388","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Multimodal Large Language Models (MLLMs) have shown excellent performance in question-answering of single-event videos. In this paper, we present question-answering dense video events, a novel task that requires answering and grounding the dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events occurring over extended time periods. To facilitate the study, we construct DeVE-QA - a dataset featuring 78K questions about 26K events on 10.6K long videos. We then benchmark and show that existing MLLMs excelling at single-event QA struggle to perform well in DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1 percent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQA respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
密集视频事件问答
多模态大语言模型(MLLMs)在单事件视频问题解答中表现出色。在本文中,我们介绍了密集视频事件的问题解答,这是一项新颖的任务,要求在长视频中回答密集事件问题并将其落地,从而挑战多模态大语言模型如何忠实地理解和推理在较长时间段内发生的多个事件。为了便于研究,我们构建了 DeVE-QA--一个数据集,其中包含 78K 个问题,涉及 10.6K 个长视频中的 26K 个事件。然后,我们进行了基准测试,结果表明,现有的 MLLM 在单个事件 QA 中表现出色,但在 DeVE-QA 中表现不佳。为了提高性能,我们提出了 DeVi,这是一种无需训练的新型 MLLM 方法,它突出了分层字幕模块、时态事件记忆模块和自一致性检查模块,能够在长视频中检测、上下文关联、记忆密集事件,并将其作为问题解答的基础。广泛的实验表明,DeVi 在回答密集事件问题和将相关视频片段接地方面更胜一筹。与现有的 MLLM 相比,它在 DeVE-QA 和 NExT-GQA 上的 G(round)QA 准确率分别显著提高了 4.1% 和 3.7%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1