Aggregated Spatio-temporal MLP-Mixer for Violence Recognition in Video Clips

Yuepeng Shen, Jenhui Chen
{"title":"Aggregated Spatio-temporal MLP-Mixer for Violence Recognition in Video Clips","authors":"Yuepeng Shen, Jenhui Chen","doi":"10.1109/IS3C57901.2023.00020","DOIUrl":null,"url":null,"abstract":"Existing violent behavior datasets are not perfect in quantity and quality due to the difficulty of collecting. Although the state-of-the-art Transformer models had shown their capability in behavior recognition, it is unsuitable for the task of short-term behavior understanding (e.g., violent behavior recognition) due to the need for a large amount of data to achieve their best performance. Recently, a simple deep learning architecture, an all multilayer perceptron (MLP) architecture called MLP-Mixer, was proposed against Transformer in the task of a few-sample dataset to obtain competitive results. Motivated by spatio-temporal features on neurons, we invent a dual-form dataset for MLP-Mixer-based model training called aggregated spatio-temporal MLP-Mixer (ASM) to handle video understanding tasks. We show that ASM outperforms the state-of-the-art Transformer models as well as some of the best-performed convolutional neural network (CNN) approaches on three public datasets, smart-city CCTV violence detection dataset (SCVD), real-life violence situations (RLVS) dataset, and Hockey fight. Experimental results further validate our idea on short-term behavior scene understanding improvement.","PeriodicalId":142483,"journal":{"name":"2023 Sixth International Symposium on Computer, Consumer and Control (IS3C)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Sixth International Symposium on Computer, Consumer and Control (IS3C)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IS3C57901.2023.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Existing violent behavior datasets are not perfect in quantity and quality due to the difficulty of collecting. Although the state-of-the-art Transformer models had shown their capability in behavior recognition, it is unsuitable for the task of short-term behavior understanding (e.g., violent behavior recognition) due to the need for a large amount of data to achieve their best performance. Recently, a simple deep learning architecture, an all multilayer perceptron (MLP) architecture called MLP-Mixer, was proposed against Transformer in the task of a few-sample dataset to obtain competitive results. Motivated by spatio-temporal features on neurons, we invent a dual-form dataset for MLP-Mixer-based model training called aggregated spatio-temporal MLP-Mixer (ASM) to handle video understanding tasks. We show that ASM outperforms the state-of-the-art Transformer models as well as some of the best-performed convolutional neural network (CNN) approaches on three public datasets, smart-city CCTV violence detection dataset (SCVD), real-life violence situations (RLVS) dataset, and Hockey fight. Experimental results further validate our idea on short-term behavior scene understanding improvement.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
视频片段中暴力识别的聚合时空mlp混频器
现有的暴力行为数据集由于收集难度大,在数量和质量上都不完善。虽然目前最先进的Transformer模型在行为识别方面已经表现出了一定的能力,但由于需要大量的数据才能达到最佳性能,因此不适合用于短期行为理解(例如暴力行为识别)的任务。最近,提出了一种简单的深度学习架构,一种称为MLP- mixer的全多层感知器(MLP)架构,以对抗Transformer在少数样本数据集的任务中获得竞争结果。基于神经元的时空特征,我们发明了一种基于MLP-Mixer模型训练的双形式数据集,称为聚合时空MLP-Mixer (ASM)来处理视频理解任务。我们表明,ASM在三个公共数据集,智能城市CCTV暴力检测数据集(SCVD),现实生活中的暴力情况(RLVS)数据集和曲棍球比赛上优于最先进的Transformer模型以及一些性能最好的卷积神经网络(CNN)方法。实验结果进一步验证了我们对短期行为场景理解的改进思路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Overview of Coordinated Frequency Control Technologies for Wind Turbines, HVDC and Energy Storage Systems Apply Masked-attention Mask Transformer to Instance Segmentation in Pathology Images A Broadband Millimeter-Wave 5G Low Noise Amplifier Design in 22 nm Fully-Depleted Silicon-on-Insulator (FD-SOI) CMOS Wearable PVDF-TrFE-based Pressure Sensors for Throat Vibrations and Arterial Pulses Monitoring Fast Detection of Fabric Defects based on Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1