Suspicious activities detection using spatial–temporal features based on vision transformer and recurrent neural network

3区 计算机科学 Q1 Computer Science Journal of Ambient Intelligence and Humanized Computing Pub Date : 2024-05-29 DOI:10.1007/s12652-024-04818-7
Saba Hameed, Javaria Amin, Muhammad Almas Anjum, Muhammad Sharif
{"title":"Suspicious activities detection using spatial–temporal features based on vision transformer and recurrent neural network","authors":"Saba Hameed, Javaria Amin, Muhammad Almas Anjum, Muhammad Sharif","doi":"10.1007/s12652-024-04818-7","DOIUrl":null,"url":null,"abstract":"<p>Nowadays there is growing demand for surveillance applications due to the safety and security from anomalous events. An anomaly in the video is referred to as an event that has some unusual behavior. Although time is required for the recognition of these anomalous events, computerized methods might help to decrease it and perform efficient prediction. However, accurate anomaly detection is still a challenge due to complex background, illumination, variations, and occlusion. To handle these challenges a method is proposed for a vision transformer convolutional recurrent neural network named ViT-CNN-RCNN model for the classification of suspicious activities based on frames and videos. The proposed pre-trained ViT-base-patch16-224-in21k model contains 224 × 224 × 3 video frames as input and converts into a 16 × 16 patch size. The ViT-base-patch16-224-in21k has a patch embedding layer, ViT encoder, and ViT transformer layer having 11 blocks, layer-norm, and ViT pooler. The ViT model is trained on selected learning parameters such as 20 training epochs, and 10 batch-size to categorize the input frames into thirteen different classes such as robbery, fighting, shooting, stealing, shoplifting, Arrest, Arson, Abuse, exploiting, Road Accident, Burglary, and Vandalism. The CNN-RNN sequential model is designed to process sequential data, that contains an input layer, GRU layer, GRU-1 Layer and Dense Layer. This model is trained on optimal hyperparameters such as 32 video frame sizes, 30 training epochs, and 16 batch-size for classification into corresponding class labels. The proposed model is evaluated on UNI-crime and UCF-crime datasets. The experimental outcomes conclude that the proposed approach better performed as compared to recently published works.</p>","PeriodicalId":14959,"journal":{"name":"Journal of Ambient Intelligence and Humanized Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Ambient Intelligence and Humanized Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12652-024-04818-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

Nowadays there is growing demand for surveillance applications due to the safety and security from anomalous events. An anomaly in the video is referred to as an event that has some unusual behavior. Although time is required for the recognition of these anomalous events, computerized methods might help to decrease it and perform efficient prediction. However, accurate anomaly detection is still a challenge due to complex background, illumination, variations, and occlusion. To handle these challenges a method is proposed for a vision transformer convolutional recurrent neural network named ViT-CNN-RCNN model for the classification of suspicious activities based on frames and videos. The proposed pre-trained ViT-base-patch16-224-in21k model contains 224 × 224 × 3 video frames as input and converts into a 16 × 16 patch size. The ViT-base-patch16-224-in21k has a patch embedding layer, ViT encoder, and ViT transformer layer having 11 blocks, layer-norm, and ViT pooler. The ViT model is trained on selected learning parameters such as 20 training epochs, and 10 batch-size to categorize the input frames into thirteen different classes such as robbery, fighting, shooting, stealing, shoplifting, Arrest, Arson, Abuse, exploiting, Road Accident, Burglary, and Vandalism. The CNN-RNN sequential model is designed to process sequential data, that contains an input layer, GRU layer, GRU-1 Layer and Dense Layer. This model is trained on optimal hyperparameters such as 32 video frame sizes, 30 training epochs, and 16 batch-size for classification into corresponding class labels. The proposed model is evaluated on UNI-crime and UCF-crime datasets. The experimental outcomes conclude that the proposed approach better performed as compared to recently published works.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用基于视觉变换器和递归神经网络的时空特征检测可疑活动
如今,由于异常事件对安全和安保的影响,对监控应用的需求日益增长。视频中的异常是指具有某些异常行为的事件。虽然识别这些异常事件需要时间,但计算机化方法可能有助于减少时间并进行有效预测。然而,由于复杂的背景、光照、变化和遮挡,准确的异常检测仍然是一个挑战。为了应对这些挑战,我们提出了一种名为 ViT-CNN-RCNN 模型的视觉变换卷积递归神经网络方法,用于根据帧和视频对可疑活动进行分类。拟议的预训练 ViT-base-patch16-224-in21k 模型包含 224 × 224 × 3 视频帧作为输入,并转换成 16 × 16 补丁大小。ViT-base-patch16-224-in21k 有一个补丁嵌入层、ViT 编码器、ViT 变换层(有 11 个块)、层规范和 ViT 池器。ViT 模型根据选定的学习参数(如 20 个训练历元和 10 个批量大小)进行训练,将输入帧分为 13 个不同的类别,如抢劫、斗殴、枪击、偷窃、商店行窃、纵火、虐待、剥削、道路事故、入室盗窃和破坏。CNN-RNN 序列模型设计用于处理序列数据,包含输入层、GRU 层、GRU-1 层和密集层。该模型在最佳超参数(如 32 个视频帧大小、30 个训练历元和 16 个批量大小)的基础上进行训练,以将数据分类为相应的类别标签。在 UNI 犯罪数据集和 UCF 犯罪数据集上对所提出的模型进行了评估。实验结果表明,与最近发表的作品相比,所提出的方法性能更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Ambient Intelligence and Humanized Computing
Journal of Ambient Intelligence and Humanized Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
9.60
自引率
0.00%
发文量
854
期刊介绍: The purpose of JAIHC is to provide a high profile, leading edge forum for academics, industrial professionals, educators and policy makers involved in the field to contribute, to disseminate the most innovative researches and developments of all aspects of ambient intelligence and humanized computing, such as intelligent/smart objects, environments/spaces, and systems. The journal discusses various technical, safety, personal, social, physical, political, artistic and economic issues. The research topics covered by the journal are (but not limited to): Pervasive/Ubiquitous Computing and Applications Cognitive wireless sensor network Embedded Systems and Software Mobile Computing and Wireless Communications Next Generation Multimedia Systems Security, Privacy and Trust Service and Semantic Computing Advanced Networking Architectures Dependable, Reliable and Autonomic Computing Embedded Smart Agents Context awareness, social sensing and inference Multi modal interaction design Ergonomics and product prototyping Intelligent and self-organizing transportation networks & services Healthcare Systems Virtual Humans & Virtual Worlds Wearables sensors and actuators
期刊最新文献
Predicting the unconfined compressive strength of stabilized soil using random forest coupled with meta-heuristic algorithms Expressive sign language system for deaf kids with MPEG-4 approach of virtual human character MEDCO: an efficient protocol for data compression in wireless body sensor network A multi-objective gene selection for cancer diagnosis using particle swarm optimization and mutual information Partial policy hidden medical data access control method based on CP-ABE
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1