{"title":"EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency","authors":"Ruoyan Pi, Peng Wu, Xiangteng He, Yuxin Peng","doi":"10.1145/3662185","DOIUrl":null,"url":null,"abstract":"<p>Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this paper proposes a video anomaly detection framework with <b>E</b>nhanced <b>O</b>bject Information and <b>G</b>lobal <b>T</b>emporal Dependencies <b>(EOGT)</b> and the main novelties are: (1) A <b>L</b>ocal <b>O</b>bject <b>A</b>nomaly <b>S</b>tream <b>(LOAS)</b> is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a <b>D</b>iffusion-based <b>O</b>bject <b>R</b>econstruction <b>N</b>etwork <b>(DORN)</b> with multimodal conditions detects anomalies with object RGB information, and an <b>O</b>bject <b>P</b>ose <b>A</b>nomaly Refiner <b>(OPA)</b> discovers anomalies with human pose information. (2) A <b>G</b>lobal <b>T</b>emporal <b>S</b>trengthening <b>S</b>tream <b>(GTSS)</b> with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"242 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3662185","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this paper proposes a video anomaly detection framework with Enhanced Object Information and Global Temporal Dependencies (EOGT) and the main novelties are: (1) A Local Object Anomaly Stream (LOAS) is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a Diffusion-based Object Reconstruction Network (DORN) with multimodal conditions detects anomalies with object RGB information, and an Object Pose Anomaly Refiner (OPA) discovers anomalies with human pose information. (2) A Global Temporal Strengthening Stream (GTSS) with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.
期刊介绍:
The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome.
TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.