CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos

IF 13.7 IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-04-11 DOI:10.1109/TIP.2025.3558089

Yang Liu;Hongjin Wang;Zepu Wang;Xiaoguang Zhu;Jing Liu;Peng Sun;Rui Tang;Jianwei Du;Victor C. M. Leung;Liang Song

{"title":"CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos","authors":"Yang Liu;Hongjin Wang;Zepu Wang;Xiaoguang Zhu;Jing Liu;Peng Sun;Rui Tang;Jianwei Du;Victor C. M. Leung;Liang Song","doi":"10.1109/TIP.2025.3558089","DOIUrl":null,"url":null,"abstract":"Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Although such methods have made significant progress benefiting from the development of deep learning, they attempt to model the statistical dependency between observable videos and semantic labels, which is a crude description of normality and lacks a systematic exploration of its underlying causal relationships. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2351-2366"},"PeriodicalIF":13.7000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10962292/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Although such methods have made significant progress benefiting from the development of deep learning, they attempt to model the statistical dependency between observable videos and semantic labels, which is a crude description of normality and lacks a systematic exploration of its underlying causal relationships. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于因果表示一致性学习的监控视频异常检测

视频异常检测（VAD）仍然是视频理解领域的一项基本而艰巨的任务，在信息取证和公共安全保护等领域有着广阔的应用前景。由于异常的稀缺性和多样性，现有方法仅使用易于收集的规则事件以无监督的方式来模拟正常时空模式的固有正态性。尽管这些方法得益于深度学习的发展取得了重大进展，但它们试图对可观察视频和语义标签之间的统计依赖性进行建模，这是对正态性的粗略描述，缺乏对其潜在因果关系的系统探索。先前的研究表明，现有的无监督VAD模型无法在现实场景中实现与标签无关的数据偏移（例如场景变化），并且由于深度神经网络的过度泛化，可能无法响应光线异常。受因果关系学习的启发，我们认为存在因果因素可以充分概括规则事件的原型模式，并在异常实例发生时呈现显着偏差。在这方面，我们提出因果表示一致性学习（CRCL）来隐式挖掘无监督视频正态性学习中潜在的场景鲁棒性因果变量。具体来说，在结构因果模型的基础上，我们提出了场景去偏见学习和因果启发的正态性学习，分别在深度表示中去除纠缠的场景偏见和学习因果视频正态性。大量的基准实验验证了我们的方法优于传统的深度表示学习。此外，烧烧研究和扩展验证表明，CRCL可以在多场景环境下处理标签无关的偏差，并在有限的训练数据下保持稳定的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量