Qiang Chen, Yang Cai, L. Brown, A. Datta, Quanfu Fan, R. Feris, Shuicheng Yan, Alexander Hauptmann, Sharath Pankanti
{"title":"监测事件检测的时空fisher矢量编码","authors":"Qiang Chen, Yang Cai, L. Brown, A. Datta, Quanfu Fan, R. Feris, Shuicheng Yan, Alexander Hauptmann, Sharath Pankanti","doi":"10.1145/2502081.2502155","DOIUrl":null,"url":null,"abstract":"We present a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2012. We investigate a statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, called MoSIFT and generated by pair-wise video frames. A Gaussian Mixture Model(GMM) is learned to model the distribution of the low level features. Then for each sliding window, the Fisher vector encoding [improvedFV] is used to generate the sample representation. The model is learnt using a Linear SVM for each event. The main novelty of our system is the introduction of Fisher vector encoding into video event detection. Fisher vector encoding has demonstrated great success in image classification. The key idea is to model the low level visual features as a Gaussian Mixture Model and to generate an intermediate vector representation for bag of features. FV encoding uses higher order statistics in place of histograms in the standard BoW. FV has several good properties: (a) it can naturally separate the video specific information from the noisy local features and (b) we can use a linear model for this representation. We build an efficient implementation for FV encoding which can attain a 10 times speed-up over real-time. We also take advantage of non-trivial object localization techniques to feed into the video event detection, e.g. multi-scale detection and non-maximum suppression. This approach outperformed the results of all other teams submissions in TRECVID SED 2012 on four of the seven event types.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Spatio-temporal fisher vector coding for surveillance event detection\",\"authors\":\"Qiang Chen, Yang Cai, L. Brown, A. Datta, Quanfu Fan, R. Feris, Shuicheng Yan, Alexander Hauptmann, Sharath Pankanti\",\"doi\":\"10.1145/2502081.2502155\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2012. We investigate a statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, called MoSIFT and generated by pair-wise video frames. A Gaussian Mixture Model(GMM) is learned to model the distribution of the low level features. Then for each sliding window, the Fisher vector encoding [improvedFV] is used to generate the sample representation. The model is learnt using a Linear SVM for each event. The main novelty of our system is the introduction of Fisher vector encoding into video event detection. Fisher vector encoding has demonstrated great success in image classification. The key idea is to model the low level visual features as a Gaussian Mixture Model and to generate an intermediate vector representation for bag of features. FV encoding uses higher order statistics in place of histograms in the standard BoW. FV has several good properties: (a) it can naturally separate the video specific information from the noisy local features and (b) we can use a linear model for this representation. We build an efficient implementation for FV encoding which can attain a 10 times speed-up over real-time. We also take advantage of non-trivial object localization techniques to feed into the video event detection, e.g. multi-scale detection and non-maximum suppression. This approach outperformed the results of all other teams submissions in TRECVID SED 2012 on four of the seven event types.\",\"PeriodicalId\":20448,\"journal\":{\"name\":\"Proceedings of the 21st ACM international conference on Multimedia\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st ACM international conference on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2502081.2502155\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2502081.2502155","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
摘要
我们提出了一个通用的事件检测系统,在TRECVID 2012的监视事件检测(SED)任务中进行了评估。我们研究了一种具有时空特征的统计方法,应用于SED任务定义的七个事件类。该方法基于局部时空描述符,称为MoSIFT,并由成对视频帧生成。学习了高斯混合模型(GMM)来模拟低层特征的分布。然后,对于每个滑动窗口,使用Fisher矢量编码[improvedFV]来生成样本表示。对每个事件使用线性支持向量机学习模型。本系统的主要新颖之处在于将Fisher矢量编码引入视频事件检测中。费雪矢量编码在图像分类中取得了巨大的成功。关键思想是将低级视觉特征建模为高斯混合模型,并为特征包生成中间向量表示。FV编码使用高阶统计量来代替标准BoW中的直方图。FV有几个很好的特性:(a)它可以自然地将视频特定信息从嘈杂的局部特征中分离出来;(b)我们可以使用线性模型来表示这种特征。我们构建了一个有效的FV编码实现,可以实现10倍的实时加速。我们还利用非平凡的目标定位技术,如多尺度检测和非最大值抑制,来为视频事件检测提供信息。该方法在7种事件类型中的4种上优于TRECVID SED 2012中所有其他团队提交的结果。
Spatio-temporal fisher vector coding for surveillance event detection
We present a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2012. We investigate a statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, called MoSIFT and generated by pair-wise video frames. A Gaussian Mixture Model(GMM) is learned to model the distribution of the low level features. Then for each sliding window, the Fisher vector encoding [improvedFV] is used to generate the sample representation. The model is learnt using a Linear SVM for each event. The main novelty of our system is the introduction of Fisher vector encoding into video event detection. Fisher vector encoding has demonstrated great success in image classification. The key idea is to model the low level visual features as a Gaussian Mixture Model and to generate an intermediate vector representation for bag of features. FV encoding uses higher order statistics in place of histograms in the standard BoW. FV has several good properties: (a) it can naturally separate the video specific information from the noisy local features and (b) we can use a linear model for this representation. We build an efficient implementation for FV encoding which can attain a 10 times speed-up over real-time. We also take advantage of non-trivial object localization techniques to feed into the video event detection, e.g. multi-scale detection and non-maximum suppression. This approach outperformed the results of all other teams submissions in TRECVID SED 2012 on four of the seven event types.