好莱坞电影中融合视觉和中级音频线索的暴力侦查

Proceedings of the 21st ACM international conference on Multimedia Pub Date : 2013-10-21 DOI:10.1145/2502081.2502187

Esra Acar, F. Hopfgartner, S. Albayrak

{"title":"好莱坞电影中融合视觉和中级音频线索的暴力侦查","authors":"Esra Acar, F. Hopfgartner, S. Albayrak","doi":"10.1145/2502081.2502187","DOIUrl":null,"url":null,"abstract":"Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth protection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":"{\"title\":\"Violence detection in hollywood movies by the fusion of visual and mid-level audio cues\",\"authors\":\"Esra Acar, F. Hopfgartner, S. Albayrak\",\"doi\":\"10.1145/2502081.2502187\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth protection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.\",\"PeriodicalId\":20448,\"journal\":{\"name\":\"Proceedings of the 21st ACM international conference on Multimedia\",\"volume\":\"46 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st ACM international conference on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2502081.2502187\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2502081.2502187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

摘要

检测电影中的暴力场景是一项重要的视频内容理解功能，例如提供自动青少年保护服务。设计暴力检测算法的一个关键问题是判别特征的选择。在本文中，我们采用了中级音频特征，并比较了它们与低级音频和视觉特征的鉴别能力。为了进一步提高暴力检测的性能，我们在决策层面将这些中级音频线索与低级视觉线索融合在一起。我们使用Mel-Frequency倒谱系数(MFCC)作为音频特征，平均运动作为视觉特征。为了学习暴力模型，我们选择了两类支持向量机(svm)。我们对好莱坞电影中暴力视频镜头的检测实验结果表明，中级音频特征比低级音频特征更具歧视性，提供的结果更精确。通过基于支持向量机的决策融合，将中级音频线索与低级视觉线索融合，进一步提高了检测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Violence detection in hollywood movies by the fusion of visual and mid-level audio cues

Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth protection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21st ACM international conference on Multimedia

自引率

0.00%

发文量

期刊最新文献

Summary abstract for the 1st ACM international workshop on personal data meets distributed multimedia πLDA: document clustering with selective structural constraints Massive-scale multimedia semantic modeling OTMedia: the French TransMedia news observatory Orchestration: tv-like mixing grammars applied to video-communication for social groups