Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-17 DOI:10.1109/TCSVT.2024.3462433
Tongtong Yuan;Xuange Zhang;Bo Liu;Kun Liu;Jian Jin;Zhenzhen Jiao
{"title":"Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models","authors":"Tongtong Yuan;Xuange Zhang;Bo Liu;Kun Liu;Jian Jin;Zhenzhen Jiao","doi":"10.1109/TCSVT.2024.3462433","DOIUrl":null,"url":null,"abstract":"Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at <uri>https://xuange923.github.io/Surveillance-Video-Understanding</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"300-314"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10681489/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at https://xuange923.github.io/Surveillance-Video-Understanding.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
监控视频与语言理解:从小型多模态模型到大型多模态模型
监控录像在公共安全中起着至关重要的作用。然而,目前与监控视频相关的任务主要集中在异常事件的分类和定位上。尽管现有的方法取得了显著的性能,但仅限于对预定义事件的检测和分类,缺乏令人满意的语义理解。为了应对这一挑战,我们引入了一种新的研究途径,专注于监控视频和语言理解(VALU),并构建了第一个多模态监控视频数据集。我们用细粒度的事件内容和时间手动标注真实世界的监控数据集UCF-Crime。我们新标注的数据集UCA (UCF-Crime Annotation)包含23,542个句子,平均长度为20个单词,其标注的视频长达110.7小时。此外,我们使用这个新创建的数据集评估了五个多模态任务上的SOTA模型,从小型到大型模型建立了新的监测VALU基线。我们的实验表明,主流模型在以前的公共数据集上表现良好,但在监控视频上表现不佳,这凸显了监控VALU的新挑战。除了进行基线实验来比较现有模型的性能外,我们还提出了多模态异常检测任务的新方法,并使用我们的数据集微调多模态大语言模型模型。所有的实验都强调了构建这种多模态数据集以推进监视人工智能的必要性。根据上述实验结果,我们进行了进一步深入的分析和讨论。数据集和代码在https://xuange923.github.io/Surveillance-Video-Understanding上提供。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
期刊最新文献
IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information 2025 Index IEEE Transactions on Circuits and Systems for Video Technology IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1