Violence Detection Based on Three-Dimensional Convolutional Neural Network with Inception-ResNet

2020 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS) Pub Date : 2020-12-11 DOI:10.1109/TOCS50858.2020.9339755

Shen Jianjie, Zou Weijun

{"title":"Violence Detection Based on Three-Dimensional Convolutional Neural Network with Inception-ResNet","authors":"Shen Jianjie, Zou Weijun","doi":"10.1109/TOCS50858.2020.9339755","DOIUrl":null,"url":null,"abstract":"Violence detection based on deep learning is a research hotspot in intelligent video surveillance. The pre-trained Three-Dimensional convolutional network (C3D) has a weak ability to extract spatiotemporal features of video. It can only achieve an accuracy of 88.2% on the UCF-101 data set, which cannot meet the accuracy requirements for detecting violent behavior in videos. Thus, this paper proposes a network architecture based on the C3D and fusion of the Inception-Resnet-v2 network residual Inception module. Through adaptive learning of feature weights, the three-dimensional features of violent behavior videos can be fully explored and the ability to express features is enhanced. Secondly, in view of the small amount of data in the data set for violence detection (HockeyFights), which easily leads to the problems of overfitting and low generalization ability, the UCF101 data set is used for fine-tune, so that the shallow layer of the network can fully extract the spatiotemporal features; Finally, the use of quantization tools to quantify network parameters and adjusting the sliding window parameters during inference can effectively improves the inference efficiency and improves the real-time performance while ensuring high accuracy. Through experiments, the accuracy of the network designed in the paper on the UCF-101 dataset is improved by 6.1% compared to the C3D network, and by 3.1% compared with the R3D network, indicating that the improved network structure can extract more spatiotemporal features, and finally achieved an accuracy of 95.1% on the HockeyFights test set.","PeriodicalId":373862,"journal":{"name":"2020 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TOCS50858.2020.9339755","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Violence detection based on deep learning is a research hotspot in intelligent video surveillance. The pre-trained Three-Dimensional convolutional network (C3D) has a weak ability to extract spatiotemporal features of video. It can only achieve an accuracy of 88.2% on the UCF-101 data set, which cannot meet the accuracy requirements for detecting violent behavior in videos. Thus, this paper proposes a network architecture based on the C3D and fusion of the Inception-Resnet-v2 network residual Inception module. Through adaptive learning of feature weights, the three-dimensional features of violent behavior videos can be fully explored and the ability to express features is enhanced. Secondly, in view of the small amount of data in the data set for violence detection (HockeyFights), which easily leads to the problems of overfitting and low generalization ability, the UCF101 data set is used for fine-tune, so that the shallow layer of the network can fully extract the spatiotemporal features; Finally, the use of quantization tools to quantify network parameters and adjusting the sliding window parameters during inference can effectively improves the inference efficiency and improves the real-time performance while ensuring high accuracy. Through experiments, the accuracy of the network designed in the paper on the UCF-101 dataset is improved by 6.1% compared to the C3D network, and by 3.1% compared with the R3D network, indicating that the improved network structure can extract more spatiotemporal features, and finally achieved an accuracy of 95.1% on the HockeyFights test set.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于Inception-ResNet的三维卷积神经网络暴力检测

基于深度学习的暴力检测是智能视频监控领域的研究热点。预训练的三维卷积网络(C3D)对视频的时空特征提取能力较弱。在UCF-101数据集上只能达到88.2%的准确率，无法满足视频中暴力行为检测的准确率要求。因此，本文提出了一种基于C3D和融合Inception- resnet -v2网络残馀Inception模块的网络架构。通过特征权值的自适应学习，可以充分挖掘暴力行为视频的三维特征，增强特征的表达能力。其次，针对暴力检测(HockeyFights)数据集中数据量少，容易导致过拟合和泛化能力低的问题，采用UCF101数据集进行微调，使网络的浅层能够充分提取时空特征;最后，利用量化工具对网络参数进行量化，并在推理过程中对滑动窗口参数进行调整，可以有效地提高推理效率，在保证高精度的同时提高实时性。通过实验，本文设计的网络在UCF-101数据集上的准确率比C3D网络提高了6.1%，比R3D网络提高了3.1%，表明改进后的网络结构可以提取更多的时空特征，最终在HockeyFights测试集上达到95.1%的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)

自引率

0.00%

发文量