Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00038
Yasaman Etesam, Leon Kochiev, Angel X. Chang
Visual Question Answering (VQA) is a widely studied problem in computer vision and natural language processing. However, current approaches to VQA have been investigated primarily in the 2D image domain. We study VQA in the 3D domain, with our input being point clouds of real-world 3D scenes, instead of 2D images. We believe that this 3D data modality provide richer spatial relation information that is of interest in the VQA task. In this paper, we introduce the 3DVQA-ScanNet dataset, the first VQA dataset in 3D, and we investigate the performance of a spectrum of baseline approaches on the 3D VQA task.
{"title":"3DVQA: Visual Question Answering for 3D Environments","authors":"Yasaman Etesam, Leon Kochiev, Angel X. Chang","doi":"10.1109/CRV55824.2022.00038","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00038","url":null,"abstract":"Visual Question Answering (VQA) is a widely studied problem in computer vision and natural language processing. However, current approaches to VQA have been investigated primarily in the 2D image domain. We study VQA in the 3D domain, with our input being point clouds of real-world 3D scenes, instead of 2D images. We believe that this 3D data modality provide richer spatial relation information that is of interest in the VQA task. In this paper, we introduce the 3DVQA-ScanNet dataset, the first VQA dataset in 3D, and we investigate the performance of a spectrum of baseline approaches on the 3D VQA task.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130468311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00017
Joo Wang Kim, J. Hernandez, Richard Cobos, Ricardo Palacios, Andres G. Abad
We propose a skeleton-based Human Action Recognition (HAR) system, robust to both noisy inputs and perspective variation. This system receives RGB videos as input and consists of three modules: (M1) 2D Key-Points Estimation module, (M2) Robustness module, and (M3) Action Classification module; of which M2 is our main contribution. This module uses pre-trained 3D pose estimator and pose refinement networks to handle noisy information including missing points, and uses rotations of the 3D poses to add robustness to camera view-point variation. To evaluate our approach, we carried out comparison experiments between models trained with M2 and without it. These experiments were conducted on the UESTC view-varying dataset, on the i3DPost multi-view human action dataset and on a Boxing Actions dataset, created by us. Our system achieved positive results, improving the accuracy by 24%, 3% and 11% on each dataset, respectively. On the UESTC dataset, our method achieves the new state of the art for the cross-view evaluation protocols.
{"title":"A View Invariant Human Action Recognition System for Noisy Inputs","authors":"Joo Wang Kim, J. Hernandez, Richard Cobos, Ricardo Palacios, Andres G. Abad","doi":"10.1109/CRV55824.2022.00017","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00017","url":null,"abstract":"We propose a skeleton-based Human Action Recognition (HAR) system, robust to both noisy inputs and perspective variation. This system receives RGB videos as input and consists of three modules: (M1) 2D Key-Points Estimation module, (M2) Robustness module, and (M3) Action Classification module; of which M2 is our main contribution. This module uses pre-trained 3D pose estimator and pose refinement networks to handle noisy information including missing points, and uses rotations of the 3D poses to add robustness to camera view-point variation. To evaluate our approach, we carried out comparison experiments between models trained with M2 and without it. These experiments were conducted on the UESTC view-varying dataset, on the i3DPost multi-view human action dataset and on a Boxing Actions dataset, created by us. Our system achieved positive results, improving the accuracy by 24%, 3% and 11% on each dataset, respectively. On the UESTC dataset, our method achieves the new state of the art for the cross-view evaluation protocols.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114337637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00037
Karim Samaha, Georges Younes, Daniel C. Asmar, J. Zelek
Auto-calibration that relies on unconstrained image content and epipolar relationships is necessary in online operations, especially when internal calibration parameters such as focal length can vary. In contrast, traditional calibration relies on a checkerboard and other scene information and are typically conducted offline. Unfortunately, auto-calibration may not always converge when solved traditionally in an iterative optimization formalism. We propose to solve for the intrinsic calibration parameters using a neural network that is trained on a synthetic Unity dataset that we created. We demonstrate our results on both synthetic and real data to validate the generalizability of our neural network model, which outperforms traditional methods by 2% to 30%, and outperforms recent deep learning approaches by a factor of 2 to 4 times.
{"title":"Learned Intrinsic Auto-Calibration From Fundamental Matrices","authors":"Karim Samaha, Georges Younes, Daniel C. Asmar, J. Zelek","doi":"10.1109/CRV55824.2022.00037","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00037","url":null,"abstract":"Auto-calibration that relies on unconstrained image content and epipolar relationships is necessary in online operations, especially when internal calibration parameters such as focal length can vary. In contrast, traditional calibration relies on a checkerboard and other scene information and are typically conducted offline. Unfortunately, auto-calibration may not always converge when solved traditionally in an iterative optimization formalism. We propose to solve for the intrinsic calibration parameters using a neural network that is trained on a synthetic Unity dataset that we created. We demonstrate our results on both synthetic and real data to validate the generalizability of our neural network model, which outperforms traditional methods by 2% to 30%, and outperforms recent deep learning approaches by a factor of 2 to 4 times.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127822748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00034
Mei-Huan Chen, J. Lang
Designing a video detection network based on state-of-the-art single-image object detectors may seem like an obvious choice. However, video object detection has extra challenges due to the lower quality of individual frames in a video, and hence the need to include temporal information for high-quality detection results. We design a novel interleaved architecture combining a 2D convolutional network and a 3D temporal network. To explore inter-frame information, we propose feature aggregation based on a temporal network. Our TemporalNet utilizes Appearance-preserving 3D convolution (AP3D) for extracting aligned features in the temporal dimension. Our temporal network functions at multiple scales for better performance, which allows communication between 2D and 3D blocks at each scale and also across scales. Our TemporalNet is a plug-and-play block that can be added to a multi-scale single-image detection network without any adjustments in the network architecture. When TemporalNet is applied to Yolov3 it is real-time with a running time of 35ms/frame on a low-end GPU. Our real-time approach achieves 77.1 % mAP (mean Average Precision) on ImageNet VID 2017 dataset with TemporalNet-4, where TemporalNet-16 achieves 80.9 % mAP which is a competitive result.
设计一个基于最先进的单图像目标检测器的视频检测网络似乎是一个显而易见的选择。然而,由于视频中单个帧的质量较低,视频目标检测面临额外的挑战,因此需要包含时间信息以获得高质量的检测结果。我们设计了一种结合二维卷积网络和三维时序网络的新型交错结构。为了挖掘帧间信息,我们提出了基于时间网络的特征聚合。我们的TemporalNet利用外观保留3D卷积(AP3D)来提取时间维度的对齐特征。我们的时间网络在多个尺度上运行以获得更好的性能,这允许在每个尺度和跨尺度的2D和3D块之间进行通信。我们的TemporalNet是一个即插即用的模块,可以添加到多尺度单图像检测网络中,而无需对网络架构进行任何调整。当TemporalNet应用于Yolov3时,它是实时的,在低端GPU上运行时间为35ms/帧。我们的实时方法在使用TemporalNet-4的ImageNet VID 2017数据集上实现了77.1%的mAP(平均平均精度),其中TemporalNet-16实现了80.9%的mAP,这是一个有竞争力的结果。
{"title":"TemporalNet: Real-time 2D-3D Video Object Detection","authors":"Mei-Huan Chen, J. Lang","doi":"10.1109/CRV55824.2022.00034","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00034","url":null,"abstract":"Designing a video detection network based on state-of-the-art single-image object detectors may seem like an obvious choice. However, video object detection has extra challenges due to the lower quality of individual frames in a video, and hence the need to include temporal information for high-quality detection results. We design a novel interleaved architecture combining a 2D convolutional network and a 3D temporal network. To explore inter-frame information, we propose feature aggregation based on a temporal network. Our TemporalNet utilizes Appearance-preserving 3D convolution (AP3D) for extracting aligned features in the temporal dimension. Our temporal network functions at multiple scales for better performance, which allows communication between 2D and 3D blocks at each scale and also across scales. Our TemporalNet is a plug-and-play block that can be added to a multi-scale single-image detection network without any adjustments in the network architecture. When TemporalNet is applied to Yolov3 it is real-time with a running time of 35ms/frame on a low-end GPU. Our real-time approach achieves 77.1 % mAP (mean Average Precision) on ImageNet VID 2017 dataset with TemporalNet-4, where TemporalNet-16 achieves 80.9 % mAP which is a competitive result.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114647008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00024
Pierre-Andre Brousseau, S. Roy
This paper proposes a novel permutation formulation to the stereo matching problem. Our proposed approach introduces a permutation volume which provides a natural representation of stereo constraints and disentangles stereo matching from monocular disparity estimation. It also has the benefit of simultaneously computing disparity and a confidence measure which provides explainability and a simple confidence heuristic for occlusions. In the context of self-supervised learning, the stereo performance is validated for standard testing datasets and the confidence maps are validated through stereo-visibility. Results show that the permutation volume increases stereo performance and features good generalization behaviour. We believe that measuring confidence is a key part of explainability which is instrumental to adoption of deep methods in critical stereo applications such as autonomous navigation.
{"title":"A Permutation Model for the Self-Supervised Stereo Matching Problem","authors":"Pierre-Andre Brousseau, S. Roy","doi":"10.1109/CRV55824.2022.00024","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00024","url":null,"abstract":"This paper proposes a novel permutation formulation to the stereo matching problem. Our proposed approach introduces a permutation volume which provides a natural representation of stereo constraints and disentangles stereo matching from monocular disparity estimation. It also has the benefit of simultaneously computing disparity and a confidence measure which provides explainability and a simple confidence heuristic for occlusions. In the context of self-supervised learning, the stereo performance is validated for standard testing datasets and the confidence maps are validated through stereo-visibility. Results show that the permutation volume increases stereo performance and features good generalization behaviour. We believe that measuring confidence is a key part of explainability which is instrumental to adoption of deep methods in critical stereo applications such as autonomous navigation.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125639818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00021
Lachlan Chumbley, Morris Gu, Rhys Newbury, J. Leitner, Akansel Cosgun
We investigate how high-resolution tactile sensors can be utilized in combination with vision and depth sensing, to improve grasp stability prediction. Recent advances in simulating high-resolution tactile sensing, in particular the TACTO simulator, enabled us to evaluate how neural networks can be trained with a combination of sensing modalities. With the large amounts of data needed to train large neural networks, robotic simulators provide a fast way to automate the data collection process. We expand on the existing work through an ablation study and an increased set of objects taken from the YCB benchmark set. Our results indicate that while the combination of vision, depth, and tactile sensing provides the best prediction results on known objects, the network fails to generalize to unknown objects. Our work also addresses existing issues with robotic grasping in tactile simulation and how to overcome them.
{"title":"Integrating High-Resolution Tactile Sensing into Grasp Stability Prediction","authors":"Lachlan Chumbley, Morris Gu, Rhys Newbury, J. Leitner, Akansel Cosgun","doi":"10.1109/CRV55824.2022.00021","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00021","url":null,"abstract":"We investigate how high-resolution tactile sensors can be utilized in combination with vision and depth sensing, to improve grasp stability prediction. Recent advances in simulating high-resolution tactile sensing, in particular the TACTO simulator, enabled us to evaluate how neural networks can be trained with a combination of sensing modalities. With the large amounts of data needed to train large neural networks, robotic simulators provide a fast way to automate the data collection process. We expand on the existing work through an ablation study and an increased set of objects taken from the YCB benchmark set. Our results indicate that while the combination of vision, depth, and tactile sensing provides the best prediction results on known objects, the network fails to generalize to unknown objects. Our work also addresses existing issues with robotic grasping in tactile simulation and how to overcome them.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133102915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00031
Vahid Reza Khazaie, A. Wong, John Taylor Jewell, Y. Mohsenzadeh
Anomaly detection is to identify samples that do not conform to the distribution of the normal data. Due to the unavailability of anomalous data, training a supervised deep neural network is a cumbersome task. As such, unsupervised methods are preferred as a common approach to solve this task. Deep autoencoders have been broadly adopted as a base of many unsupervised anomaly detection methods. However, a notable shortcoming of deep autoencoders is that they provide insufficient representations for anomaly detection by generalizing to reconstruct outliers. In this work, we have designed an adversarial framework consisting of two competing components, an Adversarial Distorter, and an Autoencoder. The Adversarial Distorter is a convolutional encoder that learns to produce effective perturbations and the autoencoder is a deep convolutional neural network that aims to reconstruct the images from the perturbed latent feature space. The networks are trained with opposing goals in which the Adversarial Distorter produces perturbations that are applied to the en-coder's latent feature space to maximize the reconstruction error and the autoencoder tries to neutralize the effect of these perturbations to minimize it. When applied to anomaly detection, the proposed method learns semantically richer representations due to applying perturbations to the feature space. The proposed method outperforms the existing state-of-the-art methods in anomaly detection on image and video datasets.
{"title":"Anomaly Detection with Adversarially Learned Perturbations of Latent Space","authors":"Vahid Reza Khazaie, A. Wong, John Taylor Jewell, Y. Mohsenzadeh","doi":"10.1109/CRV55824.2022.00031","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00031","url":null,"abstract":"Anomaly detection is to identify samples that do not conform to the distribution of the normal data. Due to the unavailability of anomalous data, training a supervised deep neural network is a cumbersome task. As such, unsupervised methods are preferred as a common approach to solve this task. Deep autoencoders have been broadly adopted as a base of many unsupervised anomaly detection methods. However, a notable shortcoming of deep autoencoders is that they provide insufficient representations for anomaly detection by generalizing to reconstruct outliers. In this work, we have designed an adversarial framework consisting of two competing components, an Adversarial Distorter, and an Autoencoder. The Adversarial Distorter is a convolutional encoder that learns to produce effective perturbations and the autoencoder is a deep convolutional neural network that aims to reconstruct the images from the perturbed latent feature space. The networks are trained with opposing goals in which the Adversarial Distorter produces perturbations that are applied to the en-coder's latent feature space to maximize the reconstruction error and the autoencoder tries to neutralize the effect of these perturbations to minimize it. When applied to anomaly detection, the proposed method learns semantically richer representations due to applying perturbations to the feature space. The proposed method outperforms the existing state-of-the-art methods in anomaly detection on image and video datasets.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115700758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00023
J. Tanner, K. Dick, J.R. Green
Can a photo be accurately geolocated within a city from its pixels alone? While this image geolocation problem has been successfully addressed at the planetary- and nation-levels when framed as a classification problem using convolutional neural networks, no method has yet been able to precisely geolocate images within the city- and/or at the street-level when framed as a latitude/longitude regression-type problem. We leverage the highly densely sampled Streetlearn dataset of imagery from Manhattan and Pittsburgh to first develop a highly accurate inter-city predictor and then experimentally resolve, for the first time, the intra-city performance limits of framing image geolocation as a regression-type problem. We then reformulate the problem as an extreme-resolution classification task by subdividing the city into hundreds of equirectangular-scaled bins and train our respective intra-city deep convolutional neural network on tens of thousands of images. Our experiments serve as a foundation to develop a scalable inter- and intra-city image geolocation framework that, on average, resolves an image within 250 m2. We demonstrate that our models outperform SIFT-based image retrieval-type models based on differing weather patterns, lighting conditions, location-specific imagery, and are temporally robust when evaluated upon both past and future imagery. Both the practical and ethical ramifications of such a model are also discussed given the threat to individual privacy in a technocentric surveillance capitalist society.
{"title":"Inter- & Intra-City Image Geolocalization","authors":"J. Tanner, K. Dick, J.R. Green","doi":"10.1109/CRV55824.2022.00023","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00023","url":null,"abstract":"Can a photo be accurately geolocated within a city from its pixels alone? While this image geolocation problem has been successfully addressed at the planetary- and nation-levels when framed as a classification problem using convolutional neural networks, no method has yet been able to precisely geolocate images within the city- and/or at the street-level when framed as a latitude/longitude regression-type problem. We leverage the highly densely sampled Streetlearn dataset of imagery from Manhattan and Pittsburgh to first develop a highly accurate inter-city predictor and then experimentally resolve, for the first time, the intra-city performance limits of framing image geolocation as a regression-type problem. We then reformulate the problem as an extreme-resolution classification task by subdividing the city into hundreds of equirectangular-scaled bins and train our respective intra-city deep convolutional neural network on tens of thousands of images. Our experiments serve as a foundation to develop a scalable inter- and intra-city image geolocation framework that, on average, resolves an image within 250 m2. We demonstrate that our models outperform SIFT-based image retrieval-type models based on differing weather patterns, lighting conditions, location-specific imagery, and are temporally robust when evaluated upon both past and future imagery. Both the practical and ethical ramifications of such a model are also discussed given the threat to individual privacy in a technocentric surveillance capitalist society.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130183709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1109/CRV55824.2022.00020
M. Baradaran, R. Bergevin
Semi-supervised video anomaly detection (VAD) methods formulate the task of anomaly detection as detection of deviations from the learned normal patterns. Previous works in the field (reconstruction or prediction-based methods) suffer from two drawbacks: 1) They focus on low-level features, and they (especially holistic approaches) do not effectively consider the object classes. 2) Object-centric approaches neglect some of the context information (such as location). To tackle these challenges, this paper proposes a novel two-stream object-aware VAD method that learns the normal appearance and motion patterns through image translation tasks. The appearance branch translates the input image to the target semantic segmentation map produced by Mask-RCNN, and the motion branch associates each frame with its expected optical flow magnitude. Any deviation from the expected appearance or motion in the inference stage shows the degree of potential abnormality. We evaluated our proposed method on the ShanghaiTech, UCSD-Pedl, and UCSD-Ped2 datasets and the results show competitive performance compared with state-of-the-art works. Most importantly, the results show that, as significant improvements to previous methods, detections by our method are completely explainable and anomalies are localized accurately in the frames.
{"title":"Object Class Aware Video Anomaly Detection through Image Translation","authors":"M. Baradaran, R. Bergevin","doi":"10.1109/CRV55824.2022.00020","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00020","url":null,"abstract":"Semi-supervised video anomaly detection (VAD) methods formulate the task of anomaly detection as detection of deviations from the learned normal patterns. Previous works in the field (reconstruction or prediction-based methods) suffer from two drawbacks: 1) They focus on low-level features, and they (especially holistic approaches) do not effectively consider the object classes. 2) Object-centric approaches neglect some of the context information (such as location). To tackle these challenges, this paper proposes a novel two-stream object-aware VAD method that learns the normal appearance and motion patterns through image translation tasks. The appearance branch translates the input image to the target semantic segmentation map produced by Mask-RCNN, and the motion branch associates each frame with its expected optical flow magnitude. Any deviation from the expected appearance or motion in the inference stage shows the degree of potential abnormality. We evaluated our proposed method on the ShanghaiTech, UCSD-Pedl, and UCSD-Ped2 datasets and the results show competitive performance compared with state-of-the-art works. Most importantly, the results show that, as significant improvements to previous methods, detections by our method are completely explainable and anomalies are localized accurately in the frames.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125956425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}