Suspicious activities detection using spatial–temporal features based on vision transformer and recurrent neural network

3区计算机科学 Q1 Computer Science Journal of Ambient Intelligence and Humanized Computing Pub Date : 2024-05-29 DOI:10.1007/s12652-024-04818-7

Saba Hameed, Javaria Amin, Muhammad Almas Anjum, Muhammad Sharif

{"title":"Suspicious activities detection using spatial–temporal features based on vision transformer and recurrent neural network","authors":"Saba Hameed, Javaria Amin, Muhammad Almas Anjum, Muhammad Sharif","doi":"10.1007/s12652-024-04818-7","DOIUrl":null,"url":null,"abstract":"<p>Nowadays there is growing demand for surveillance applications due to the safety and security from anomalous events. An anomaly in the video is referred to as an event that has some unusual behavior. Although time is required for the recognition of these anomalous events, computerized methods might help to decrease it and perform efficient prediction. However, accurate anomaly detection is still a challenge due to complex background, illumination, variations, and occlusion. To handle these challenges a method is proposed for a vision transformer convolutional recurrent neural network named ViT-CNN-RCNN model for the classification of suspicious activities based on frames and videos. The proposed pre-trained ViT-base-patch16-224-in21k model contains 224 × 224 × 3 video frames as input and converts into a 16 × 16 patch size. The ViT-base-patch16-224-in21k has a patch embedding layer, ViT encoder, and ViT transformer layer having 11 blocks, layer-norm, and ViT pooler. The ViT model is trained on selected learning parameters such as 20 training epochs, and 10 batch-size to categorize the input frames into thirteen different classes such as robbery, fighting, shooting, stealing, shoplifting, Arrest, Arson, Abuse, exploiting, Road Accident, Burglary, and Vandalism. The CNN-RNN sequential model is designed to process sequential data, that contains an input layer, GRU layer, GRU-1 Layer and Dense Layer. This model is trained on optimal hyperparameters such as 32 video frame sizes, 30 training epochs, and 16 batch-size for classification into corresponding class labels. The proposed model is evaluated on UNI-crime and UCF-crime datasets. The experimental outcomes conclude that the proposed approach better performed as compared to recently published works.</p>","PeriodicalId":14959,"journal":{"name":"Journal of Ambient Intelligence and Humanized Computing","volume":"95 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Ambient Intelligence and Humanized Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12652-024-04818-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays there is growing demand for surveillance applications due to the safety and security from anomalous events. An anomaly in the video is referred to as an event that has some unusual behavior. Although time is required for the recognition of these anomalous events, computerized methods might help to decrease it and perform efficient prediction. However, accurate anomaly detection is still a challenge due to complex background, illumination, variations, and occlusion. To handle these challenges a method is proposed for a vision transformer convolutional recurrent neural network named ViT-CNN-RCNN model for the classification of suspicious activities based on frames and videos. The proposed pre-trained ViT-base-patch16-224-in21k model contains 224 × 224 × 3 video frames as input and converts into a 16 × 16 patch size. The ViT-base-patch16-224-in21k has a patch embedding layer, ViT encoder, and ViT transformer layer having 11 blocks, layer-norm, and ViT pooler. The ViT model is trained on selected learning parameters such as 20 training epochs, and 10 batch-size to categorize the input frames into thirteen different classes such as robbery, fighting, shooting, stealing, shoplifting, Arrest, Arson, Abuse, exploiting, Road Accident, Burglary, and Vandalism. The CNN-RNN sequential model is designed to process sequential data, that contains an input layer, GRU layer, GRU-1 Layer and Dense Layer. This model is trained on optimal hyperparameters such as 32 video frame sizes, 30 training epochs, and 16 batch-size for classification into corresponding class labels. The proposed model is evaluated on UNI-crime and UCF-crime datasets. The experimental outcomes conclude that the proposed approach better performed as compared to recently published works.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用基于视觉变换器和递归神经网络的时空特征检测可疑活动

如今，由于异常事件对安全和安保的影响，对监控应用的需求日益增长。视频中的异常是指具有某些异常行为的事件。虽然识别这些异常事件需要时间，但计算机化方法可能有助于减少时间并进行有效预测。然而，由于复杂的背景、光照、变化和遮挡，准确的异常检测仍然是一个挑战。为了应对这些挑战，我们提出了一种名为 ViT-CNN-RCNN 模型的视觉变换卷积递归神经网络方法，用于根据帧和视频对可疑活动进行分类。拟议的预训练 ViT-base-patch16-224-in21k 模型包含 224 × 224 × 3 视频帧作为输入，并转换成 16 × 16 补丁大小。ViT-base-patch16-224-in21k 有一个补丁嵌入层、ViT 编码器、ViT 变换层（有 11 个块）、层规范和 ViT 池器。ViT 模型根据选定的学习参数（如 20 个训练历元和 10 个批量大小）进行训练，将输入帧分为 13 个不同的类别，如抢劫、斗殴、枪击、偷窃、商店行窃、纵火、虐待、剥削、道路事故、入室盗窃和破坏。CNN-RNN 序列模型设计用于处理序列数据，包含输入层、GRU 层、GRU-1 层和密集层。该模型在最佳超参数（如 32 个视频帧大小、30 个训练历元和 16 个批量大小）的基础上进行训练，以将数据分类为相应的类别标签。在 UNI 犯罪数据集和 UCF 犯罪数据集上对所提出的模型进行了评估。实验结果表明，与最近发表的作品相比，所提出的方法性能更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Ambient Intelligence and Humanized Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

9.60

自引率

0.00%

发文量

854

期刊介绍： The purpose of JAIHC is to provide a high profile, leading edge forum for academics, industrial professionals, educators and policy makers involved in the field to contribute, to disseminate the most innovative researches and developments of all aspects of ambient intelligence and humanized computing, such as intelligent/smart objects, environments/spaces, and systems. The journal discusses various technical, safety, personal, social, physical, political, artistic and economic issues. The research topics covered by the journal are (but not limited to): Pervasive/Ubiquitous Computing and Applications Cognitive wireless sensor network Embedded Systems and Software Mobile Computing and Wireless Communications Next Generation Multimedia Systems Security, Privacy and Trust Service and Semantic Computing Advanced Networking Architectures Dependable, Reliable and Autonomic Computing Embedded Smart Agents Context awareness, social sensing and inference Multi modal interaction design Ergonomics and product prototyping Intelligent and self-organizing transportation networks & services Healthcare Systems Virtual Humans & Virtual Worlds Wearables sensors and actuators