Video Anomaly Detection via self-supervised and spatio-temporal proxy tasks learning

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2024-09-14 DOI:10.1016/j.patcog.2024.111021

{"title":"Video Anomaly Detection via self-supervised and spatio-temporal proxy tasks learning","authors":"","doi":"10.1016/j.patcog.2024.111021","DOIUrl":null,"url":null,"abstract":"<div><p>Video Anomaly Detection (VAD) aims to identify events in videos that deviate from typical patterns. Given the scarcity of anomalous samples, previous research has primarily focused on learning regular patterns from datasets exclusively containing normal behaviors, and treating deviations from these patterns as anomalies. However, most of these methods are constrained by coarse-grained modeling approaches that renders them incapable of learning highly-discriminative features, which are necessary to effectively distinguish between the subtle differences between normal and abnormal behaviors. To better capture these features, we propose an innovative method. Initially, pseudo-anomalous samples for appearance and motion are generated through geometric transformations (2D rotations) and the scrambling of video sequences. Subsequently, a dual-branch network featuring spatio-temporal decoupling is proposed, in which the spatial and temporal branches each handle a specific proxy task. These tasks are designed to distinguish between normal and pseudo-anomalous samples, involving operations such as predicting patch-based 2D rotation angles and classifying video frame triplets as total-anomaly, left-anomaly, right-anomaly, and non-anomaly. Our approach employs an end-to-end training methodology, without relying on pre-trained models (except for the object detector). Evaluations on the UCSD Ped2, Avenue, and ShanghaiTech datasets show that our method achieved AUC scores of 99.1%, 91.9%, and 81.1%, respectively, demonstrating its effectiveness. The code is publicly accessible at the following link: <span><span>https://spatio-temporal-tasks</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324007726","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video Anomaly Detection (VAD) aims to identify events in videos that deviate from typical patterns. Given the scarcity of anomalous samples, previous research has primarily focused on learning regular patterns from datasets exclusively containing normal behaviors, and treating deviations from these patterns as anomalies. However, most of these methods are constrained by coarse-grained modeling approaches that renders them incapable of learning highly-discriminative features, which are necessary to effectively distinguish between the subtle differences between normal and abnormal behaviors. To better capture these features, we propose an innovative method. Initially, pseudo-anomalous samples for appearance and motion are generated through geometric transformations (2D rotations) and the scrambling of video sequences. Subsequently, a dual-branch network featuring spatio-temporal decoupling is proposed, in which the spatial and temporal branches each handle a specific proxy task. These tasks are designed to distinguish between normal and pseudo-anomalous samples, involving operations such as predicting patch-based 2D rotation angles and classifying video frame triplets as total-anomaly, left-anomaly, right-anomaly, and non-anomaly. Our approach employs an end-to-end training methodology, without relying on pre-trained models (except for the object detector). Evaluations on the UCSD Ped2, Avenue, and ShanghaiTech datasets show that our method achieved AUC scores of 99.1%, 91.9%, and 81.1%, respectively, demonstrating its effectiveness. The code is publicly accessible at the following link: https://spatio-temporal-tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过自监督和时空代理任务学习进行视频异常检测

视频异常检测（VAD）旨在识别视频中偏离典型模式的事件。鉴于异常样本的稀缺性，以往的研究主要侧重于从仅包含正常行为的数据集中学习常规模式，并将偏离这些模式的行为视为异常。然而，这些方法大多受到粗粒度建模方法的限制，无法学习高区分度特征，而这些特征是有效区分正常行为和异常行为之间细微差别的必要条件。为了更好地捕捉这些特征，我们提出了一种创新方法。首先，通过几何变换（二维旋转）和扰乱视频序列生成外观和运动的伪异常样本。随后，提出了一种以时空解耦为特征的双分支网络，其中空间和时间分支分别处理特定的代理任务。这些任务旨在区分正常样本和伪异常样本，涉及的操作包括预测基于补丁的二维旋转角度，以及将视频帧三胞胎分类为总异常、左异常、右异常和非异常。我们的方法采用端到端训练方法，不依赖预训练模型（物体检测器除外）。在 UCSD Ped2、Avenue 和 ShanghaiTech 数据集上进行的评估表明，我们的方法的 AUC 分数分别达到了 99.1%、91.9% 和 81.1%，证明了它的有效性。代码可通过以下链接公开访问：https://spatio-temporal-tasks。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.

期刊最新文献

A novel domain independent scene text localizer Video Anomaly Detection via self-supervised and spatio-temporal proxy tasks learning FICE: Text-conditioned fashion-image editing with guided GAN inversion Collaborative graph neural networks for augmented graphs: A local-to-global perspective Asymmetric patch sampling for contrastive learning