Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection

IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Computer Journal Pub Date : 2023-10-20 DOI:10.1093/comjnl/bxad103
Chenghao Li, Xinyan Yang, Gang Liang
{"title":"Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection","authors":"Chenghao Li, Xinyan Yang, Gang Liang","doi":"10.1093/comjnl/bxad103","DOIUrl":null,"url":null,"abstract":"Abstract Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art.","PeriodicalId":50641,"journal":{"name":"Computer Journal","volume":"127 46 1","pages":"0"},"PeriodicalIF":1.5000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/comjnl/bxad103","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于多路径激励的关键帧引导视频Swin变压器暴力检测
摘要暴力检测是一项关键任务,旨在通过提取帧并应用分类模型来识别视频中的暴力行为。然而,视频数据的复杂性和暴力事件的突发性给准确定位暴力事件带来了重大障碍,使得提取表明暴力的帧成为一项具有挑战性的工作。此外,设计和应用高性能的暴力检测模型仍然是一个悬而未决的问题。传统模型将从采样帧中提取的空间特征直接嵌入到时间序列中,忽略了视频的时空特征,限制了表达相邻帧之间连续变化的能力。为了解决现有的挑战,本文提出了一个名为ACTION-VST的新框架。首先,开发了一种关键帧提取算法,以选择最可能代表视频中暴力场景的帧。为了将视觉序列转化为时空特征映射,提出了一种多路径激励模块来激活时空、通道和运动特征。其次,采用一种先进的基于视频旋转变压器的网络进行全局和局部时空建模,从而实现对暴力的全面特征提取和表示。该方法在RLVS和RWF-2000两个大型数据集上进行了验证,准确率分别超过98%和93%,超过了目前的水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Journal
Computer Journal 工程技术-计算机:软件工程
CiteScore
3.60
自引率
7.10%
发文量
164
审稿时长
4.8 months
期刊介绍: The Computer Journal is one of the longest-established journals serving all branches of the academic computer science community. It is currently published in four sections.
期刊最新文献
Correction to: Automatic Diagnosis of Diabetic Retinopathy from Retinal Abnormalities: Improved Jaya-Based Feature Selection and Recurrent Neural Network Eager Term Rewriting For The Fracterm Calculus Of Common Meadows An Intrusion Detection Method Based on Attention Mechanism to Improve CNN-BiLSTM Model Enhancing Auditory Brainstem Response Classification Based On Vision Transformer Leveraging Meta-Learning To Improve Unsupervised Domain Adaptation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1