{"title":"Violence Detection Based on Three-Dimensional Convolutional Neural Network with Inception-ResNet","authors":"Shen Jianjie, Zou Weijun","doi":"10.1109/TOCS50858.2020.9339755","DOIUrl":null,"url":null,"abstract":"Violence detection based on deep learning is a research hotspot in intelligent video surveillance. The pre-trained Three-Dimensional convolutional network (C3D) has a weak ability to extract spatiotemporal features of video. It can only achieve an accuracy of 88.2% on the UCF-101 data set, which cannot meet the accuracy requirements for detecting violent behavior in videos. Thus, this paper proposes a network architecture based on the C3D and fusion of the Inception-Resnet-v2 network residual Inception module. Through adaptive learning of feature weights, the three-dimensional features of violent behavior videos can be fully explored and the ability to express features is enhanced. Secondly, in view of the small amount of data in the data set for violence detection (HockeyFights), which easily leads to the problems of overfitting and low generalization ability, the UCF101 data set is used for fine-tune, so that the shallow layer of the network can fully extract the spatiotemporal features; Finally, the use of quantization tools to quantify network parameters and adjusting the sliding window parameters during inference can effectively improves the inference efficiency and improves the real-time performance while ensuring high accuracy. Through experiments, the accuracy of the network designed in the paper on the UCF-101 dataset is improved by 6.1% compared to the C3D network, and by 3.1% compared with the R3D network, indicating that the improved network structure can extract more spatiotemporal features, and finally achieved an accuracy of 95.1% on the HockeyFights test set.","PeriodicalId":373862,"journal":{"name":"2020 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TOCS50858.2020.9339755","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Violence detection based on deep learning is a research hotspot in intelligent video surveillance. The pre-trained Three-Dimensional convolutional network (C3D) has a weak ability to extract spatiotemporal features of video. It can only achieve an accuracy of 88.2% on the UCF-101 data set, which cannot meet the accuracy requirements for detecting violent behavior in videos. Thus, this paper proposes a network architecture based on the C3D and fusion of the Inception-Resnet-v2 network residual Inception module. Through adaptive learning of feature weights, the three-dimensional features of violent behavior videos can be fully explored and the ability to express features is enhanced. Secondly, in view of the small amount of data in the data set for violence detection (HockeyFights), which easily leads to the problems of overfitting and low generalization ability, the UCF101 data set is used for fine-tune, so that the shallow layer of the network can fully extract the spatiotemporal features; Finally, the use of quantization tools to quantify network parameters and adjusting the sliding window parameters during inference can effectively improves the inference efficiency and improves the real-time performance while ensuring high accuracy. Through experiments, the accuracy of the network designed in the paper on the UCF-101 dataset is improved by 6.1% compared to the C3D network, and by 3.1% compared with the R3D network, indicating that the improved network structure can extract more spatiotemporal features, and finally achieved an accuracy of 95.1% on the HockeyFights test set.