2022 19th Conference on Robots and Vision (CRV)最新文献

英文中文

3DVQA: Visual Question Answering for 3D Environments 3DVQA: 3D环境的视觉问答

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00038

Yasaman Etesam, Leon Kochiev, Angel X. Chang

Visual Question Answering (VQA) is a widely studied problem in computer vision and natural language processing. However, current approaches to VQA have been investigated primarily in the 2D image domain. We study VQA in the 3D domain, with our input being point clouds of real-world 3D scenes, instead of 2D images. We believe that this 3D data modality provide richer spatial relation information that is of interest in the VQA task. In this paper, we introduce the 3DVQA-ScanNet dataset, the first VQA dataset in 3D, and we investigate the performance of a spectrum of baseline approaches on the 3D VQA task.

视觉问答(Visual Question answer, VQA)是计算机视觉和自然语言处理领域中一个被广泛研究的问题。然而，目前的VQA方法主要是在二维图像领域进行研究的。我们在3D领域研究VQA，我们的输入是真实3D场景的点云，而不是2D图像。我们相信这种3D数据模式提供了对VQA任务感兴趣的更丰富的空间关系信息。在本文中，我们介绍了3DVQA-ScanNet数据集，这是第一个3DVQA数据集，我们研究了一系列基线方法在3DVQA任务上的性能。

引用次数: 3

Conference Organization: CRV 2022 会议组织:CRV 2022

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/crv55824.2022.00006

引用次数: 0

A View Invariant Human Action Recognition System for Noisy Inputs 基于噪声输入的视觉不变人体动作识别系统

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00017

Joo Wang Kim, J. Hernandez, Richard Cobos, Ricardo Palacios, Andres G. Abad

We propose a skeleton-based Human Action Recognition (HAR) system, robust to both noisy inputs and perspective variation. This system receives RGB videos as input and consists of three modules: (M1) 2D Key-Points Estimation module, (M2) Robustness module, and (M3) Action Classification module; of which M2 is our main contribution. This module uses pre-trained 3D pose estimator and pose refinement networks to handle noisy information including missing points, and uses rotations of the 3D poses to add robustness to camera view-point variation. To evaluate our approach, we carried out comparison experiments between models trained with M2 and without it. These experiments were conducted on the UESTC view-varying dataset, on the i3DPost multi-view human action dataset and on a Boxing Actions dataset, created by us. Our system achieved positive results, improving the accuracy by 24%, 3% and 11% on each dataset, respectively. On the UESTC dataset, our method achieves the new state of the art for the cross-view evaluation protocols.

我们提出了一种基于骨骼的人体动作识别(HAR)系统，该系统对噪声输入和视角变化都具有鲁棒性。该系统以RGB视频为输入，由三个模块组成:(M1)二维关键点估计模块，(M2)鲁棒性模块，(M3)动作分类模块;其中M2是我们的主要贡献。该模块使用预训练的3D姿态估计器和姿态细化网络来处理包括缺失点在内的噪声信息，并使用3D姿态的旋转来增加相机视点变化的鲁棒性。为了评估我们的方法，我们在使用M2和不使用M2训练的模型之间进行了比较实验。这些实验是在我们创建的UESTC视图变化数据集，i3DPost多视图人体动作数据集和拳击动作数据集上进行的。我们的系统取得了积极的结果，在每个数据集上分别提高了24%，3%和11%的准确率。在UESTC数据集上，我们的方法实现了跨视图评估协议的新状态。

引用次数: 0

Learned Intrinsic Auto-Calibration From Fundamental Matrices 从基本矩阵学习固有的自动校准

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00037

Karim Samaha, Georges Younes, Daniel C. Asmar, J. Zelek

Auto-calibration that relies on unconstrained image content and epipolar relationships is necessary in online operations, especially when internal calibration parameters such as focal length can vary. In contrast, traditional calibration relies on a checkerboard and other scene information and are typically conducted offline. Unfortunately, auto-calibration may not always converge when solved traditionally in an iterative optimization formalism. We propose to solve for the intrinsic calibration parameters using a neural network that is trained on a synthetic Unity dataset that we created. We demonstrate our results on both synthetic and real data to validate the generalizability of our neural network model, which outperforms traditional methods by 2% to 30%, and outperforms recent deep learning approaches by a factor of 2 to 4 times.

在在线操作中，依赖于不受约束的图像内容和极缘关系的自动校准是必要的，特别是当内部校准参数(如焦距)可能变化时。相比之下，传统的校准依赖于棋盘和其他场景信息，通常是离线进行的。不幸的是，在传统的迭代优化形式下，自动校准可能并不总是收敛的。我们建议使用在我们创建的合成Unity数据集上训练的神经网络来解决固有校准参数。我们在合成数据和真实数据上展示了我们的结果，以验证我们的神经网络模型的泛化性，该模型比传统方法高出2%到30%，比最近的深度学习方法高出2到4倍。

引用次数: 0

TemporalNet: Real-time 2D-3D Video Object Detection TemporalNet:实时2D-3D视频对象检测

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00034

Mei-Huan Chen, J. Lang

Designing a video detection network based on state-of-the-art single-image object detectors may seem like an obvious choice. However, video object detection has extra challenges due to the lower quality of individual frames in a video, and hence the need to include temporal information for high-quality detection results. We design a novel interleaved architecture combining a 2D convolutional network and a 3D temporal network. To explore inter-frame information, we propose feature aggregation based on a temporal network. Our TemporalNet utilizes Appearance-preserving 3D convolution (AP3D) for extracting aligned features in the temporal dimension. Our temporal network functions at multiple scales for better performance, which allows communication between 2D and 3D blocks at each scale and also across scales. Our TemporalNet is a plug-and-play block that can be added to a multi-scale single-image detection network without any adjustments in the network architecture. When TemporalNet is applied to Yolov3 it is real-time with a running time of 35ms/frame on a low-end GPU. Our real-time approach achieves 77.1 % mAP (mean Average Precision) on ImageNet VID 2017 dataset with TemporalNet-4, where TemporalNet-16 achieves 80.9 % mAP which is a competitive result.

设计一个基于最先进的单图像目标检测器的视频检测网络似乎是一个显而易见的选择。然而，由于视频中单个帧的质量较低，视频目标检测面临额外的挑战，因此需要包含时间信息以获得高质量的检测结果。我们设计了一种结合二维卷积网络和三维时序网络的新型交错结构。为了挖掘帧间信息，我们提出了基于时间网络的特征聚合。我们的TemporalNet利用外观保留3D卷积(AP3D)来提取时间维度的对齐特征。我们的时间网络在多个尺度上运行以获得更好的性能，这允许在每个尺度和跨尺度的2D和3D块之间进行通信。我们的TemporalNet是一个即插即用的模块，可以添加到多尺度单图像检测网络中，而无需对网络架构进行任何调整。当TemporalNet应用于Yolov3时，它是实时的，在低端GPU上运行时间为35ms/帧。我们的实时方法在使用TemporalNet-4的ImageNet VID 2017数据集上实现了77.1%的mAP(平均平均精度)，其中TemporalNet-16实现了80.9%的mAP，这是一个有竞争力的结果。

{"title":"TemporalNet: Real-time 2D-3D Video Object Detection","authors":"Mei-Huan Chen, J. Lang","doi":"10.1109/CRV55824.2022.00034","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00034","url":null,"abstract":"Designing a video detection network based on state-of-the-art single-image object detectors may seem like an obvious choice. However, video object detection has extra challenges due to the lower quality of individual frames in a video, and hence the need to include temporal information for high-quality detection results. We design a novel interleaved architecture combining a 2D convolutional network and a 3D temporal network. To explore inter-frame information, we propose feature aggregation based on a temporal network. Our TemporalNet utilizes Appearance-preserving 3D convolution (AP3D) for extracting aligned features in the temporal dimension. Our temporal network functions at multiple scales for better performance, which allows communication between 2D and 3D blocks at each scale and also across scales. Our TemporalNet is a plug-and-play block that can be added to a multi-scale single-image detection network without any adjustments in the network architecture. When TemporalNet is applied to Yolov3 it is real-time with a running time of 35ms/frame on a low-end GPU. Our real-time approach achieves 77.1 % mAP (mean Average Precision) on ImageNet VID 2017 dataset with TemporalNet-4, where TemporalNet-16 achieves 80.9 % mAP which is a competitive result.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114647008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Permutation Model for the Self-Supervised Stereo Matching Problem 自监督立体匹配问题的置换模型

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00024

Pierre-Andre Brousseau, S. Roy

This paper proposes a novel permutation formulation to the stereo matching problem. Our proposed approach introduces a permutation volume which provides a natural representation of stereo constraints and disentangles stereo matching from monocular disparity estimation. It also has the benefit of simultaneously computing disparity and a confidence measure which provides explainability and a simple confidence heuristic for occlusions. In the context of self-supervised learning, the stereo performance is validated for standard testing datasets and the confidence maps are validated through stereo-visibility. Results show that the permutation volume increases stereo performance and features good generalization behaviour. We believe that measuring confidence is a key part of explainability which is instrumental to adoption of deep methods in critical stereo applications such as autonomous navigation.

针对立体匹配问题，提出了一种新的排列公式。我们提出的方法引入了一个排列体，它提供了立体约束的自然表示，并将立体匹配从单眼视差估计中解脱出来。它还具有同时计算视差和置信度度量的优点，该度量为遮挡提供了可解释性和简单的置信度启发式。在自监督学习的背景下，通过标准测试数据集验证了立体性能，并通过立体可见性验证了置信度图。结果表明，排列体积提高了立体效果，具有良好的泛化行为。我们认为，测量信心是可解释性的关键部分，这有助于在自主导航等关键立体应用中采用深度方法。

引用次数: 1

Integrating High-Resolution Tactile Sensing into Grasp Stability Prediction 将高分辨率触觉传感集成到抓握稳定性预测中

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00021

Lachlan Chumbley, Morris Gu, Rhys Newbury, J. Leitner, Akansel Cosgun

We investigate how high-resolution tactile sensors can be utilized in combination with vision and depth sensing, to improve grasp stability prediction. Recent advances in simulating high-resolution tactile sensing, in particular the TACTO simulator, enabled us to evaluate how neural networks can be trained with a combination of sensing modalities. With the large amounts of data needed to train large neural networks, robotic simulators provide a fast way to automate the data collection process. We expand on the existing work through an ablation study and an increased set of objects taken from the YCB benchmark set. Our results indicate that while the combination of vision, depth, and tactile sensing provides the best prediction results on known objects, the network fails to generalize to unknown objects. Our work also addresses existing issues with robotic grasping in tactile simulation and how to overcome them.

我们研究了如何将高分辨率触觉传感器与视觉和深度传感相结合，以提高抓取稳定性预测。在模拟高分辨率触觉感知方面的最新进展，特别是TACTO模拟器，使我们能够评估神经网络如何与传感模式的组合进行训练。由于训练大型神经网络所需的大量数据，机器人模拟器提供了一种快速自动化数据收集过程的方法。我们通过消融研究和从YCB基准集中获取的一组增加的对象来扩展现有的工作。我们的研究结果表明，虽然视觉、深度和触觉的组合在已知物体上提供了最好的预测结果，但网络不能推广到未知物体。我们的工作还解决了触觉模拟中机器人抓取存在的问题以及如何克服这些问题。

引用次数: 0

Anomaly Detection with Adversarially Learned Perturbations of Latent Space 基于潜空间逆学习扰动的异常检测

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00031

Vahid Reza Khazaie, A. Wong, John Taylor Jewell, Y. Mohsenzadeh

Anomaly detection is to identify samples that do not conform to the distribution of the normal data. Due to the unavailability of anomalous data, training a supervised deep neural network is a cumbersome task. As such, unsupervised methods are preferred as a common approach to solve this task. Deep autoencoders have been broadly adopted as a base of many unsupervised anomaly detection methods. However, a notable shortcoming of deep autoencoders is that they provide insufficient representations for anomaly detection by generalizing to reconstruct outliers. In this work, we have designed an adversarial framework consisting of two competing components, an Adversarial Distorter, and an Autoencoder. The Adversarial Distorter is a convolutional encoder that learns to produce effective perturbations and the autoencoder is a deep convolutional neural network that aims to reconstruct the images from the perturbed latent feature space. The networks are trained with opposing goals in which the Adversarial Distorter produces perturbations that are applied to the en-coder's latent feature space to maximize the reconstruction error and the autoencoder tries to neutralize the effect of these perturbations to minimize it. When applied to anomaly detection, the proposed method learns semantically richer representations due to applying perturbations to the feature space. The proposed method outperforms the existing state-of-the-art methods in anomaly detection on image and video datasets.

异常检测是识别不符合正常数据分布的样本。由于异常数据的不可用性，训练监督深度神经网络是一项繁琐的任务。因此，无监督方法是解决此任务的常用方法。深度自编码器已被广泛采用为许多无监督异常检测方法的基础。然而，深度自编码器的一个显著缺点是，通过泛化重建异常值来提供不足的异常检测表示。在这项工作中，我们设计了一个对抗性框架，由两个相互竞争的组件组成，一个对抗性扭曲器和一个自动编码器。对抗性失真器是一种学习产生有效扰动的卷积编码器，自编码器是一种深度卷积神经网络，旨在从扰动的潜在特征空间重构图像。网络以相反的目标进行训练，其中对抗性失真器产生摄动，这些摄动应用于编码器的潜在特征空间以最大化重建误差，而自编码器试图中和这些摄动的影响以最小化重构误差。当应用于异常检测时，由于对特征空间施加扰动，该方法学习到更丰富的语义表示。该方法在图像和视频数据集的异常检测方面优于现有的先进方法。

{"title":"Anomaly Detection with Adversarially Learned Perturbations of Latent Space","authors":"Vahid Reza Khazaie, A. Wong, John Taylor Jewell, Y. Mohsenzadeh","doi":"10.1109/CRV55824.2022.00031","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00031","url":null,"abstract":"Anomaly detection is to identify samples that do not conform to the distribution of the normal data. Due to the unavailability of anomalous data, training a supervised deep neural network is a cumbersome task. As such, unsupervised methods are preferred as a common approach to solve this task. Deep autoencoders have been broadly adopted as a base of many unsupervised anomaly detection methods. However, a notable shortcoming of deep autoencoders is that they provide insufficient representations for anomaly detection by generalizing to reconstruct outliers. In this work, we have designed an adversarial framework consisting of two competing components, an Adversarial Distorter, and an Autoencoder. The Adversarial Distorter is a convolutional encoder that learns to produce effective perturbations and the autoencoder is a deep convolutional neural network that aims to reconstruct the images from the perturbed latent feature space. The networks are trained with opposing goals in which the Adversarial Distorter produces perturbations that are applied to the en-coder's latent feature space to maximize the reconstruction error and the autoencoder tries to neutralize the effect of these perturbations to minimize it. When applied to anomaly detection, the proposed method learns semantically richer representations due to applying perturbations to the feature space. The proposed method outperforms the existing state-of-the-art methods in anomaly detection on image and video datasets.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115700758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Inter- & Intra-City Image Geolocalization 城市间与城市内形象地理定位

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00023

J. Tanner, K. Dick, J.R. Green

Can a photo be accurately geolocated within a city from its pixels alone? While this image geolocation problem has been successfully addressed at the planetary- and nation-levels when framed as a classification problem using convolutional neural networks, no method has yet been able to precisely geolocate images within the city- and/or at the street-level when framed as a latitude/longitude regression-type problem. We leverage the highly densely sampled Streetlearn dataset of imagery from Manhattan and Pittsburgh to first develop a highly accurate inter-city predictor and then experimentally resolve, for the first time, the intra-city performance limits of framing image geolocation as a regression-type problem. We then reformulate the problem as an extreme-resolution classification task by subdividing the city into hundreds of equirectangular-scaled bins and train our respective intra-city deep convolutional neural network on tens of thousands of images. Our experiments serve as a foundation to develop a scalable inter- and intra-city image geolocation framework that, on average, resolves an image within 250 m2. We demonstrate that our models outperform SIFT-based image retrieval-type models based on differing weather patterns, lighting conditions, location-specific imagery, and are temporally robust when evaluated upon both past and future imagery. Both the practical and ethical ramifications of such a model are also discussed given the threat to individual privacy in a technocentric surveillance capitalist society.

一张照片能否仅凭像素精确地定位在一个城市内?虽然这个图像地理定位问题已经成功地在行星和国家层面上解决了，当作为一个使用卷积神经网络的分类问题框架时，还没有方法能够精确地定位城市和/或街道层面的图像，当作为一个纬度/经度回归类型的问题框架时。我们利用来自曼哈顿和匹兹堡的高度密集采样的Streetlearn图像数据集，首先开发了一个高度精确的城市间预测器，然后首次通过实验解决了将帧图像地理定位作为回归类型问题的城市内性能限制。然后，我们通过将城市细分为数百个等矩形尺度的箱子，并在数万张图像上训练我们各自的城市内部深度卷积神经网络，将问题重新表述为一个极端分辨率的分类任务。我们的实验为开发可扩展的城市间和城市内图像地理定位框架奠定了基础，该框架平均可在250平方米内解析图像。我们证明，我们的模型优于基于sift的基于不同天气模式、照明条件、特定位置图像的图像检索型模型，并且在对过去和未来图像进行评估时具有时间鲁棒性。鉴于在以技术为中心的监控资本主义社会中对个人隐私的威胁，还讨论了这种模式的实践和伦理后果。

{"title":"Inter- & Intra-City Image Geolocalization","authors":"J. Tanner, K. Dick, J.R. Green","doi":"10.1109/CRV55824.2022.00023","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00023","url":null,"abstract":"Can a photo be accurately geolocated within a city from its pixels alone? While this image geolocation problem has been successfully addressed at the planetary- and nation-levels when framed as a classification problem using convolutional neural networks, no method has yet been able to precisely geolocate images within the city- and/or at the street-level when framed as a latitude/longitude regression-type problem. We leverage the highly densely sampled Streetlearn dataset of imagery from Manhattan and Pittsburgh to first develop a highly accurate inter-city predictor and then experimentally resolve, for the first time, the intra-city performance limits of framing image geolocation as a regression-type problem. We then reformulate the problem as an extreme-resolution classification task by subdividing the city into hundreds of equirectangular-scaled bins and train our respective intra-city deep convolutional neural network on tens of thousands of images. Our experiments serve as a foundation to develop a scalable inter- and intra-city image geolocation framework that, on average, resolves an image within 250 m2. We demonstrate that our models outperform SIFT-based image retrieval-type models based on differing weather patterns, lighting conditions, location-specific imagery, and are temporally robust when evaluated upon both past and future imagery. Both the practical and ethical ramifications of such a model are also discussed given the threat to individual privacy in a technocentric surveillance capitalist society.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130183709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Object Class Aware Video Anomaly Detection through Image Translation 基于图像翻译的对象类感知视频异常检测

2022 19th Conference on Robots and Vision (CRV)

Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00020

M. Baradaran, R. Bergevin

Semi-supervised video anomaly detection (VAD) methods formulate the task of anomaly detection as detection of deviations from the learned normal patterns. Previous works in the field (reconstruction or prediction-based methods) suffer from two drawbacks: 1) They focus on low-level features, and they (especially holistic approaches) do not effectively consider the object classes. 2) Object-centric approaches neglect some of the context information (such as location). To tackle these challenges, this paper proposes a novel two-stream object-aware VAD method that learns the normal appearance and motion patterns through image translation tasks. The appearance branch translates the input image to the target semantic segmentation map produced by Mask-RCNN, and the motion branch associates each frame with its expected optical flow magnitude. Any deviation from the expected appearance or motion in the inference stage shows the degree of potential abnormality. We evaluated our proposed method on the ShanghaiTech, UCSD-Pedl, and UCSD-Ped2 datasets and the results show competitive performance compared with state-of-the-art works. Most importantly, the results show that, as significant improvements to previous methods, detections by our method are completely explainable and anomalies are localized accurately in the frames.

半监督视频异常检测(VAD)方法将异常检测的任务定义为检测与学习到的正常模式的偏差。该领域以前的工作(重建或基于预测的方法)有两个缺点:1)它们关注底层特征，并且它们(特别是整体方法)没有有效地考虑对象类。2)以对象为中心的方法忽略了一些上下文信息(如位置)。为了解决这些问题，本文提出了一种新的双流对象感知VAD方法，该方法通过图像翻译任务学习正常的外观和运动模式。外观分支将输入图像转换为Mask-RCNN生成的目标语义分割图，运动分支将每帧与其预期的光流幅度相关联。在推理阶段，任何与预期外观或动作的偏差都表明潜在异常的程度。我们在ShanghaiTech、UCSD-Pedl和UCSD-Ped2数据集上对我们提出的方法进行了评估，结果显示与目前最先进的方法相比，我们的方法具有竞争力。最重要的是，结果表明，作为先前方法的显著改进，我们的方法检测是完全可解释的，并且在帧中准确定位异常。

{"title":"Object Class Aware Video Anomaly Detection through Image Translation","authors":"M. Baradaran, R. Bergevin","doi":"10.1109/CRV55824.2022.00020","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00020","url":null,"abstract":"Semi-supervised video anomaly detection (VAD) methods formulate the task of anomaly detection as detection of deviations from the learned normal patterns. Previous works in the field (reconstruction or prediction-based methods) suffer from two drawbacks: 1) They focus on low-level features, and they (especially holistic approaches) do not effectively consider the object classes. 2) Object-centric approaches neglect some of the context information (such as location). To tackle these challenges, this paper proposes a novel two-stream object-aware VAD method that learns the normal appearance and motion patterns through image translation tasks. The appearance branch translates the input image to the target semantic segmentation map produced by Mask-RCNN, and the motion branch associates each frame with its expected optical flow magnitude. Any deviation from the expected appearance or motion in the inference stage shows the degree of potential abnormality. We evaluated our proposed method on the ShanghaiTech, UCSD-Pedl, and UCSD-Ped2 datasets and the results show competitive performance compared with state-of-the-art works. Most importantly, the results show that, as significant improvements to previous methods, detections by our method are completely explainable and anomalies are localized accurately in the frames.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125956425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2022 19th Conference on Robots and Vision (CRV)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀