{"title":"Violent Scene Detection of Film Videos Based on Multi-Task Learning of Temporal-Spatial Features","authors":"Z. Zheng, Wei Zhong, Long Ye, Li Fang, Qin Zhang","doi":"10.1109/MIPR51284.2021.00067","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a new framework for the violent scene detection of film videos based on multi-task learning of temporal-spatial features. In the proposed framework, for the violent behavior representation of film clips, we employ a temporal excitation and aggregation network to extract the temporal-spatial deep features in the visual modality. And on the other hand, a recurrent neural network with local attention is utilized to extract the utterance-level representation of affective analysis in the audio modality. In the process of feature mapping, we concern the task of violent scene detection together with that of affective analysis and then propose a multi-task learning strategy to effectively predict the violent scene of film clips. To evaluate the effectiveness of the proposed method, the experiments are done on the task of violent scenes detection 2015. The experimental results show that our model outperforms most of the state of the art methods, validating the innovation of considering the task of violent scene detection jointly with the violence emotion analysis.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MIPR51284.2021.00067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
In this paper, we propose a new framework for the violent scene detection of film videos based on multi-task learning of temporal-spatial features. In the proposed framework, for the violent behavior representation of film clips, we employ a temporal excitation and aggregation network to extract the temporal-spatial deep features in the visual modality. And on the other hand, a recurrent neural network with local attention is utilized to extract the utterance-level representation of affective analysis in the audio modality. In the process of feature mapping, we concern the task of violent scene detection together with that of affective analysis and then propose a multi-task learning strategy to effectively predict the violent scene of film clips. To evaluate the effectiveness of the proposed method, the experiments are done on the task of violent scenes detection 2015. The experimental results show that our model outperforms most of the state of the art methods, validating the innovation of considering the task of violent scene detection jointly with the violence emotion analysis.