SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses

Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu
{"title":"SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses","authors":"Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu","doi":"arxiv-2408.01669","DOIUrl":null,"url":null,"abstract":"Video grounding is a fundamental problem in multimodal content understanding,\naiming to localize specific natural language queries in an untrimmed video.\nHowever, current video grounding datasets merely focus on simple events and are\neither limited to shorter videos or brief sentences, which hinders the model\nfrom evolving toward stronger multimodal understanding capabilities. To address\nthese limitations, we present a large-scale video grounding dataset named\nSynopGround, in which more than 2800 hours of videos are sourced from popular\nTV dramas and are paired with accurately localized human-written synopses. Each\nparagraph in the synopsis serves as a language query and is manually annotated\nwith precise temporal boundaries in the long video. These paragraph queries are\ntightly correlated to each other and contain a wealth of abstract expressions\nsummarizing video storylines and specific descriptions portraying event\ndetails, which enables the model to learn multimodal perception on more\nintricate concepts over longer context dependencies. Based on the dataset, we\nfurther introduce a more complex setting of video grounding dubbed\nMulti-Paragraph Video Grounding (MPVG), which takes as input multiple\nparagraphs and a long video for grounding each paragraph query to its temporal\ninterval. In addition, we propose a novel Local-Global Multimodal Reasoner\n(LGMR) to explicitly model the local-global structures of long-term multimodal\ninputs for MPVG. Our method provides an effective baseline solution to the\nmulti-paragraph video grounding problem. Extensive experiments verify the\nproposed model's effectiveness as well as its superiority in long-term\nmulti-paragraph video grounding over prior state-of-the-arts. Dataset and code\nare publicly available. Project page: https://synopground.github.io/.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"93 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.01669","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Video grounding is a fundamental problem in multimodal content understanding, aiming to localize specific natural language queries in an untrimmed video. However, current video grounding datasets merely focus on simple events and are either limited to shorter videos or brief sentences, which hinders the model from evolving toward stronger multimodal understanding capabilities. To address these limitations, we present a large-scale video grounding dataset named SynopGround, in which more than 2800 hours of videos are sourced from popular TV dramas and are paired with accurately localized human-written synopses. Each paragraph in the synopsis serves as a language query and is manually annotated with precise temporal boundaries in the long video. These paragraph queries are tightly correlated to each other and contain a wealth of abstract expressions summarizing video storylines and specific descriptions portraying event details, which enables the model to learn multimodal perception on more intricate concepts over longer context dependencies. Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval. In addition, we propose a novel Local-Global Multimodal Reasoner (LGMR) to explicitly model the local-global structures of long-term multimodal inputs for MPVG. Our method provides an effective baseline solution to the multi-paragraph video grounding problem. Extensive experiments verify the proposed model's effectiveness as well as its superiority in long-term multi-paragraph video grounding over prior state-of-the-arts. Dataset and code are publicly available. Project page: https://synopground.github.io/.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SynopGround:从电视剧和梗概中提取多段落视频的大规模数据集
视频接地是多模态内容理解中的一个基本问题,其目的是在未经修剪的视频中定位特定的自然语言查询。然而,目前的视频接地数据集只关注简单的事件,要么局限于较短的视频,要么局限于简短的句子,这阻碍了模型向更强的多模态理解能力发展。为了解决这些局限性,我们提出了一个名为 "SynopGround "的大规模视频接地数据集,其中有 2800 多个小时的视频来源于热门电视剧,并与精确定位的人工撰写的梗概配对。梗概中的每个段落都可作为语言查询,并在长视频中人工标注了精确的时间界限。这些段落查询彼此密切相关,包含大量概括视频故事情节的抽象表达和描绘事件细节的具体描述,这使得模型能够在更长的上下文依赖关系中学习更复杂概念的多模态感知。在该数据集的基础上,我们进一步引入了更为复杂的视频接地设置,即多段落视频接地(MPVG),它将多个段落和一段长视频作为输入,将每个段落的查询与其时间间隔接地。此外,我们还提出了一种新颖的局部-全局多模态推理器(LGMR),为 MPVG 明确建模长期多模态输入的局部-全局结构。我们的方法为多段视频接地问题提供了有效的基准解决方案。广泛的实验验证了所提模型的有效性,以及它在长时多段视频接地方面优于之前的技术水平。数据集和代码可公开获取。项目页面:https://synopground.github.io/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1