利用关键帧识别和带有卷积块注意力模块的 3D CNN 检测视频中的人类暴力行为

IF 1.8 3区工程技术 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC Circuits, Systems and Signal Processing Pub Date : 2024-08-13 DOI:10.1007/s00034-024-02824-w

Venkatesh Akula, Ilaiah Kavati

{"title":"利用关键帧识别和带有卷积块注意力模块的 3D CNN 检测视频中的人类暴力行为","authors":"Venkatesh Akula, Ilaiah Kavati","doi":"10.1007/s00034-024-02824-w","DOIUrl":null,"url":null,"abstract":"<p>In recent years, there has been an increase in demand for intelligent automatic surveillance systems to detect abnormal activities at various places, such as schools, hospitals, prisons, psychiatric centers, and public gatherings. The availability of video surveillance cameras in such places enables techniques for automatically identifying violent actions and alerting the authorities to minimize loss. Deep learning-based models, such as Convolutional Neural Networks (CNNs), have shown better performance in detecting violent activities by utilizing the spatiotemporal features of video frames. In this work, we propose a violence detection model based on 3D CNN, which employs a DenseNet architecture for enhanced spatiotemporal feature capture. First, the video’s redundant frames are discarded by identifying the key frames in the video. We exploit the Multi-Scale Structural Similarity Index Measure (MS-SSIM) technique to identify the key frames of the video, which contain significant information about the video. Key frame identification helps to reduce the complexity of the model. Next, the identified video key frames with the lowest MS-SSIM are forwarded to 3D CNN to extract spatiotemporal features. Furthermore, we exploit the Convolutional Block Attention Module (CBAM) to increase the representational capabilities of the 3D CNN. The results on different benchmark datasets show that the proposed violence detection method performs better than most of the existing methods. The source code for the proposed method is publicly available at https://github.com/venkateshakula19/violence-detection-using-keyframe-extraction-and-CNN-with-attention-CBAM</p>","PeriodicalId":10227,"journal":{"name":"Circuits, Systems and Signal Processing","volume":"58 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Human Violence Detection in Videos Using Key Frame Identification and 3D CNN with Convolutional Block Attention Module\",\"authors\":\"Venkatesh Akula, Ilaiah Kavati\",\"doi\":\"10.1007/s00034-024-02824-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In recent years, there has been an increase in demand for intelligent automatic surveillance systems to detect abnormal activities at various places, such as schools, hospitals, prisons, psychiatric centers, and public gatherings. The availability of video surveillance cameras in such places enables techniques for automatically identifying violent actions and alerting the authorities to minimize loss. Deep learning-based models, such as Convolutional Neural Networks (CNNs), have shown better performance in detecting violent activities by utilizing the spatiotemporal features of video frames. In this work, we propose a violence detection model based on 3D CNN, which employs a DenseNet architecture for enhanced spatiotemporal feature capture. First, the video’s redundant frames are discarded by identifying the key frames in the video. We exploit the Multi-Scale Structural Similarity Index Measure (MS-SSIM) technique to identify the key frames of the video, which contain significant information about the video. Key frame identification helps to reduce the complexity of the model. Next, the identified video key frames with the lowest MS-SSIM are forwarded to 3D CNN to extract spatiotemporal features. Furthermore, we exploit the Convolutional Block Attention Module (CBAM) to increase the representational capabilities of the 3D CNN. The results on different benchmark datasets show that the proposed violence detection method performs better than most of the existing methods. The source code for the proposed method is publicly available at https://github.com/venkateshakula19/violence-detection-using-keyframe-extraction-and-CNN-with-attention-CBAM</p>\",\"PeriodicalId\":10227,\"journal\":{\"name\":\"Circuits, Systems and Signal Processing\",\"volume\":\"58 1\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Circuits, Systems and Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1007/s00034-024-02824-w\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Circuits, Systems and Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1007/s00034-024-02824-w","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

近年来，在学校、医院、监狱、精神病治疗中心和公共集会等各种场所检测异常活动的智能自动监控系统的需求不断增加。在这些场所安装视频监控摄像头，可实现自动识别暴力行为并向当局发出警报的技术，从而将损失降到最低。基于深度学习的模型，如卷积神经网络（CNN），通过利用视频帧的时空特征，在检测暴力活动方面表现出了更好的性能。在这项工作中，我们提出了一种基于 3D CNN 的暴力检测模型，该模型采用 DenseNet 架构来增强时空特征捕捉。首先，通过识别视频中的关键帧，剔除视频中的冗余帧。我们利用多尺度结构相似性指数测量（MS-SSIM）技术来识别视频中包含重要视频信息的关键帧。关键帧识别有助于降低模型的复杂性。接下来，MS-SSIM 值最低的已识别视频关键帧将被转发到 3D CNN，以提取时空特征。此外，我们还利用卷积块注意力模块（CBAM）来提高 3D CNN 的表征能力。在不同基准数据集上的结果表明，所提出的暴力检测方法的性能优于大多数现有方法。建议方法的源代码可在 https://github.com/venkateshakula19/violence-detection-using-keyframe-extraction-and-CNN-with-attention-CBAM 上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Human Violence Detection in Videos Using Key Frame Identification and 3D CNN with Convolutional Block Attention Module

In recent years, there has been an increase in demand for intelligent automatic surveillance systems to detect abnormal activities at various places, such as schools, hospitals, prisons, psychiatric centers, and public gatherings. The availability of video surveillance cameras in such places enables techniques for automatically identifying violent actions and alerting the authorities to minimize loss. Deep learning-based models, such as Convolutional Neural Networks (CNNs), have shown better performance in detecting violent activities by utilizing the spatiotemporal features of video frames. In this work, we propose a violence detection model based on 3D CNN, which employs a DenseNet architecture for enhanced spatiotemporal feature capture. First, the video’s redundant frames are discarded by identifying the key frames in the video. We exploit the Multi-Scale Structural Similarity Index Measure (MS-SSIM) technique to identify the key frames of the video, which contain significant information about the video. Key frame identification helps to reduce the complexity of the model. Next, the identified video key frames with the lowest MS-SSIM are forwarded to 3D CNN to extract spatiotemporal features. Furthermore, we exploit the Convolutional Block Attention Module (CBAM) to increase the representational capabilities of the 3D CNN. The results on different benchmark datasets show that the proposed violence detection method performs better than most of the existing methods. The source code for the proposed method is publicly available at https://github.com/venkateshakula19/violence-detection-using-keyframe-extraction-and-CNN-with-attention-CBAM

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Circuits, Systems and Signal Processing 工程技术-工程：电子与电气

CiteScore

4.80

自引率

13.00%

发文量

321

审稿时长

4.6 months

期刊介绍： Rapid developments in the analog and digital processing of signals for communication, control, and computer systems have made the theory of electrical circuits and signal processing a burgeoning area of research and design. The aim of Circuits, Systems, and Signal Processing (CSSP) is to help meet the needs of outlets for significant research papers and state-of-the-art review articles in the area. The scope of the journal is broad, ranging from mathematical foundations to practical engineering design. It encompasses, but is not limited to, such topics as linear and nonlinear networks, distributed circuits and systems, multi-dimensional signals and systems, analog filters and signal processing, digital filters and signal processing, statistical signal processing, multimedia, computer aided design, graph theory, neural systems, communication circuits and systems, and VLSI signal processing. The Editorial Board is international, and papers are welcome from throughout the world. The journal is devoted primarily to research papers, but survey, expository, and tutorial papers are also published. Circuits, Systems, and Signal Processing (CSSP) is published twelve times annually.