2021 IEEE/CVF International Conference on Computer Vision (ICCV)最新文献_第7页

GeomNet: A Neural Network Based on Riemannian Geometries of SPD Matrix Space and Cholesky Space for 3D Skeleton-Based Interaction Recognition 基于SPD矩阵空间和Cholesky空间黎曼几何的三维骨架交互识别神经网络

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01313

X. Nguyen

In this paper, we propose a novel method for representation and classification of two-person interactions from 3D skeleton sequences. The key idea of our approach is to use Gaussian distributions to capture statistics on ℝn and those on the space of symmetric positive definite (SPD) matrices. The main challenge is how to parametrize those distributions. Towards this end, we develop methods for embedding Gaussian distributions in matrix groups based on the theory of Lie groups and Riemannian symmetric spaces. Our method relies on the Riemannian geometry of the underlying manifolds and has the advantage of encoding high-order statistics from 3D joint positions. We show that the proposed method achieves competitive results in two-person interaction recognition on three benchmarks for 3D human activity understanding.

在本文中，我们提出了一种新的方法来表示和分类从三维骨骼序列的两人互动。我们的方法的关键思想是使用高斯分布来捕获在算子和对称正定(SPD)矩阵空间上的统计量。主要的挑战是如何将这些分布参数化。为此，我们基于李群和黎曼对称空间的理论，发展了在矩阵群中嵌入高斯分布的方法。我们的方法依赖于底层流形的黎曼几何，并且具有从三维关节位置编码高阶统计量的优点。我们表明，所提出的方法在三维人类活动理解的三个基准上实现了两人交互识别的竞争性结果。

引用次数: 19

High Quality Disparity Remapping with Two-Stage Warping 高质量的视差重映射与两阶段翘曲

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00227

Bing Li, Chia-Wen Lin, Cheng Zheng, Sha Liu, Junsong Yuan, Bernard Ghanem, C. J. Kuo, King Abdullah

A high quality disparity remapping method that preserves 2D shapes and 3D structures, and adjusts disparities of important objects in stereo image pairs is proposed. It is formulated as a constrained optimization problem, whose solution is challenging, since we need to meet multiple requirements of disparity remapping simultaneously. The one-stage optimization process either degrades the quality of important objects or introduces serious distortions in background regions. To address this challenge, we propose a two-stage warping process to solve it. In the first stage, we develop a warping model that finds the optimal warping grids for important objects to fulfill multiple requirements of disparity remapping. In the second stage, we derive another warping model to refine warping results in less important regions by eliminating serious distortions in shape, disparity and 3D structure. The superior performance of the proposed method is demonstrated by experimental results.

提出了一种保留二维形状和三维结构的高质量视差重映射方法，并对立体图像对中重要目标的视差进行调整。将其表述为一个约束优化问题，求解该问题具有挑战性，因为我们需要同时满足视差重映射的多个要求。单阶段优化过程要么降低重要目标的质量，要么在背景区域引入严重的失真。为了解决这一挑战，我们提出了一个两阶段的翘曲过程来解决它。在第一阶段，我们建立了一个翘曲模型，为重要对象找到最优的翘曲网格，以满足视差重映射的多种要求。在第二阶段，我们推导了另一个翘曲模型，通过消除形状，视差和3D结构的严重扭曲来细化不太重要区域的翘曲结果。实验结果证明了该方法的优越性。

引用次数: 1

SOMA: Solving Optical Marker-Based MoCap Automatically SOMA:自动解决基于光学标记的动作捕捉

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01093

N. Ghorbani, Michael J. Black

Marker-based optical motion capture (mocap) is the "gold standard" method for acquiring accurate 3D human motion in computer vision, medicine, and graphics. The raw output of these systems are noisy and incomplete 3D points or short tracklets of points. To be useful, one must associate these points with corresponding markers on the captured subject; i.e. "labelling". Given these labels, one can then "solve" for the 3D skeleton or body surface mesh. Commercial auto-labeling tools require a specific calibration procedure at capture time, which is not possible for archival data. Here we train a novel neural network called SOMA, which takes raw mocap point clouds with varying numbers of points, labels them at scale without any calibration data, independent of the capture technology, and requiring only minimal human intervention. Our key insight is that, while labeling point clouds is highly ambiguous, the 3D body provides strong constraints on the solution that can be exploited by a learning-based method. To enable learning, we generate massive training sets of simulated noisy and ground truth mocap markers animated by 3D bodies from AMASS. SOMA exploits an architecture with stacked self-attention elements to learn the spatial structure of the 3D body and an optimal transport layer to constrain the assignment (labeling) problem while rejecting outliers. We extensively evaluate SOMA both quantitatively and qualitatively. SOMA is more accurate and robust than existing state of the art research methods and can be applied where commercial systems cannot. We automatically label over 8 hours of archival mocap data across 4 different datasets captured using various technologies and output SMPL-X body models. The model and data is released for research purposes at https://soma.is.tue.mpg.de/.

基于标记的光学运动捕捉(mocap)是在计算机视觉、医学和图形学中获取精确的3D人体运动的“黄金标准”方法。这些系统的原始输出是有噪声的和不完整的三维点或点的短轨迹。为了有用，必须将这些点与拍摄对象上的相应标记联系起来;即。“标签”。给定这些标签，就可以“解决”3D骨架或身体表面网格。商业自动标记工具在捕获时需要特定的校准程序，这对于存档数据是不可能的。在这里，我们训练了一个名为SOMA的新型神经网络，它采用具有不同数量点的原始动作捕捉点云，在没有任何校准数据的情况下按规模标记它们，独立于捕获技术，只需要最小的人为干预。我们的关键见解是，虽然标记点云是高度模糊的，但3D体对解决方案提供了强有力的约束，可以通过基于学习的方法加以利用。为了实现学习，我们生成了大量的训练集，这些训练集由AMASS的3D物体动画模拟的噪声和地面真实动作捕捉标记。SOMA利用具有堆叠自关注元素的架构来学习3D体的空间结构，并利用最优传输层来约束分配(标记)问题，同时拒绝异常值。我们对SOMA进行了广泛的定量和定性评估。SOMA比现有的最先进的研究方法更准确和健壮，可以应用于商业系统无法应用的地方。我们自动标记超过8小时的档案动作捕捉数据跨越4个不同的数据集使用各种技术捕获和输出SMPL-X身体模型。该模型和数据在https://soma.is.tue.mpg.de/上发布用于研究目的。

{"title":"SOMA: Solving Optical Marker-Based MoCap Automatically","authors":"N. Ghorbani, Michael J. Black","doi":"10.1109/ICCV48922.2021.01093","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01093","url":null,"abstract":"Marker-based optical motion capture (mocap) is the \"gold standard\" method for acquiring accurate 3D human motion in computer vision, medicine, and graphics. The raw output of these systems are noisy and incomplete 3D points or short tracklets of points. To be useful, one must associate these points with corresponding markers on the captured subject; i.e. \"labelling\". Given these labels, one can then \"solve\" for the 3D skeleton or body surface mesh. Commercial auto-labeling tools require a specific calibration procedure at capture time, which is not possible for archival data. Here we train a novel neural network called SOMA, which takes raw mocap point clouds with varying numbers of points, labels them at scale without any calibration data, independent of the capture technology, and requiring only minimal human intervention. Our key insight is that, while labeling point clouds is highly ambiguous, the 3D body provides strong constraints on the solution that can be exploited by a learning-based method. To enable learning, we generate massive training sets of simulated noisy and ground truth mocap markers animated by 3D bodies from AMASS. SOMA exploits an architecture with stacked self-attention elements to learn the spatial structure of the 3D body and an optimal transport layer to constrain the assignment (labeling) problem while rejecting outliers. We extensively evaluate SOMA both quantitatively and qualitatively. SOMA is more accurate and robust than existing state of the art research methods and can be applied where commercial systems cannot. We automatically label over 8 hours of archival mocap data across 4 different datasets captured using various technologies and output SMPL-X body models. The model and data is released for research purposes at https://soma.is.tue.mpg.de/.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"11097-11106"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88179681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering 毛发:视频问答的层次视觉语义关系推理

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00172

Fei Liu, Jing Liu, Weining Wang, Hanqing Lu

Relational reasoning is at the heart of video question answering. However, existing approaches suffer from several common limitations: (1) they only focus on either object-level or frame-level relational reasoning, and fail to integrate the both; and (2) they neglect to leverage semantic knowledge for relational reasoning. In this work, we propose a Hierarchical VisuAl-Semantic RelatIonal Reasoning (HAIR) framework to address these limitations. Specifically, we present a novel graph memory mechanism to perform relational reasoning, and further develop two types of graph memory: a) visual graph memory that leverages visual information of video for relational reasoning; b) semantic graph memory that is specifically designed to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform relational reasoning in the semantic space. Taking advantage of both graph memory mechanisms, we build a hierarchical framework to enable visual-semantic relational reasoning from object level to frame level. Experiments on four challenging benchmark datasets show that the proposed framework leads to state-of-the-art performance, with fewer parameters and faster inference speed. Besides, our approach also shows superior performance on other video+language task.

关系推理是视频问答的核心。然而，现有的方法存在几个常见的局限性:(1)它们只关注对象级或框架级的关系推理，而不能将两者集成;(2)他们忽略了利用语义知识进行关系推理。在这项工作中，我们提出了一个层次视觉语义关系推理(HAIR)框架来解决这些限制。具体来说，我们提出了一种新的图记忆机制来执行关系推理，并进一步发展了两种类型的图记忆:a)利用视频的视觉信息进行关系推理的视觉图记忆;B)语义图内存，专门设计用于显式地利用包含在视频对象的类和属性中的语义知识，并在语义空间中执行关系推理。利用这两种图形记忆机制，我们构建了一个分层框架，以实现从对象级到框架级的视觉语义关系推理。在四个具有挑战性的基准数据集上的实验表明，所提出的框架具有最先进的性能，参数更少，推理速度更快。此外，我们的方法在其他视频+语言任务上也表现出优异的性能。

{"title":"HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering","authors":"Fei Liu, Jing Liu, Weining Wang, Hanqing Lu","doi":"10.1109/ICCV48922.2021.00172","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00172","url":null,"abstract":"Relational reasoning is at the heart of video question answering. However, existing approaches suffer from several common limitations: (1) they only focus on either object-level or frame-level relational reasoning, and fail to integrate the both; and (2) they neglect to leverage semantic knowledge for relational reasoning. In this work, we propose a Hierarchical VisuAl-Semantic RelatIonal Reasoning (HAIR) framework to address these limitations. Specifically, we present a novel graph memory mechanism to perform relational reasoning, and further develop two types of graph memory: a) visual graph memory that leverages visual information of video for relational reasoning; b) semantic graph memory that is specifically designed to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform relational reasoning in the semantic space. Taking advantage of both graph memory mechanisms, we build a hierarchical framework to enable visual-semantic relational reasoning from object level to frame level. Experiments on four challenging benchmark datasets show that the proposed framework leads to state-of-the-art performance, with fewer parameters and faster inference speed. Besides, our approach also shows superior performance on other video+language task.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"7 1","pages":"1678-1687"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75993509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Unsupervised Non-Rigid Image Distortion Removal via Grid Deformation 通过网格变形去除无监督非刚性图像畸变

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00252

Nianyi Li, Simron Thapa, Cameron Whyte, Albert W. Reed, Suren Jayasuriya, Jinwei Ye

Many computer vision problems face difficulties when imaging through turbulent refractive media (e.g., air and water) due to the refraction and scattering of light. These effects cause geometric distortion that requires either handcrafted physical priors or supervised learning methods to remove. In this paper, we present a novel unsupervised network to recover the latent distortion-free image. The key idea is to model non-rigid distortions as deformable grids. Our network consists of a grid deformer that estimates the distortion field and an image generator that outputs the distortion-free image. By leveraging the positional encoding operator, we can simplify the network structure while maintaining fine spatial details in the recovered images. Our method doesn't need to be trained on labeled data and has good transferability across various turbulent image datasets with different types of distortions. Extensive experiments on both simulated and real-captured turbulent images demonstrate that our method can remove both air and water distortions without much customization.

由于光的折射和散射，许多计算机视觉问题在通过湍流折射介质(如空气和水)成像时面临困难。这些影响导致几何扭曲，需要手工制作的物理先验或监督学习方法来消除。在本文中，我们提出了一种新的无监督网络来恢复潜在的无失真图像。关键思想是将非刚性变形建模为可变形网格。我们的网络由一个估计失真场的网格变形器和一个输出无失真图像的图像生成器组成。利用位置编码算子，可以简化网络结构，同时保持恢复图像的精细空间细节。我们的方法不需要在标记数据上进行训练，并且在具有不同类型失真的各种湍流图像数据集之间具有良好的可移植性。在模拟和实际捕获的湍流图像上进行的大量实验表明，我们的方法可以消除空气和水的畸变，而无需太多定制。

{"title":"Unsupervised Non-Rigid Image Distortion Removal via Grid Deformation","authors":"Nianyi Li, Simron Thapa, Cameron Whyte, Albert W. Reed, Suren Jayasuriya, Jinwei Ye","doi":"10.1109/ICCV48922.2021.00252","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00252","url":null,"abstract":"Many computer vision problems face difficulties when imaging through turbulent refractive media (e.g., air and water) due to the refraction and scattering of light. These effects cause geometric distortion that requires either handcrafted physical priors or supervised learning methods to remove. In this paper, we present a novel unsupervised network to recover the latent distortion-free image. The key idea is to model non-rigid distortions as deformable grids. Our network consists of a grid deformer that estimates the distortion field and an image generator that outputs the distortion-free image. By leveraging the positional encoding operator, we can simplify the network structure while maintaining fine spatial details in the recovered images. Our method doesn't need to be trained on labeled data and has good transferability across various turbulent image datasets with different types of distortions. Extensive experiments on both simulated and real-captured turbulent images demonstrate that our method can remove both air and water distortions without much customization.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"41 1","pages":"2502-2512"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79899446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Towards A Universal Model for Cross-Dataset Crowd Counting 面向跨数据集人群计数的通用模型

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00319

Zhiheng Ma, Xiaopeng Hong, Xing Wei, Yunfeng Qiu, Yihong Gong

This paper proposes to handle the practical problem of learning a universal model for crowd counting across scenes and datasets. We dissect that the crux of this problem is the catastrophic sensitivity of crowd counters to scale shift, which is very common in the real world and caused by factors such as different scene layouts and image resolutions. Therefore it is difficult to train a universal model that can be applied to various scenes. To address this problem, we propose scale alignment as a prime module for establishing a novel crowd counting framework. We derive a closed-form solution to get the optimal image rescaling factors for alignment by minimizing the distances between their scale distributions. A novel neural network together with a loss function based on an efficient sliced Wasserstein distance is also proposed for scale distribution estimation. Benefiting from the proposed method, we have learned a universal model that generally works well on several datasets where can even outperform state-of-the-art models that are particularly fine-tuned for each dataset significantly. Experiments also demonstrate the much better generalizability of our model to unseen scenes.

本文提出了一个跨场景和数据集的人群计数通用模型学习的实际问题。我们剖析了这个问题的关键是人群计数器对尺度移动的灾难性敏感性，这在现实世界中很常见，是由不同的场景布局和图像分辨率等因素引起的。因此，很难训练出一个适用于各种场景的通用模型。为了解决这个问题，我们提出将规模对齐作为建立新的人群计数框架的主要模块。我们推导了一个封闭形式的解决方案，通过最小化它们的尺度分布之间的距离来获得最佳的图像重新缩放因子。本文还提出了一种基于有效的Wasserstein距离切片的损失函数神经网络用于尺度分布估计。从所提出的方法中受益，我们已经学习了一个通用模型，该模型通常在几个数据集上工作得很好，甚至可以优于最先进的模型，这些模型对每个数据集进行了特别的微调。实验也证明了我们的模型对未知场景有更好的泛化能力。

{"title":"Towards A Universal Model for Cross-Dataset Crowd Counting","authors":"Zhiheng Ma, Xiaopeng Hong, Xing Wei, Yunfeng Qiu, Yihong Gong","doi":"10.1109/ICCV48922.2021.00319","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00319","url":null,"abstract":"This paper proposes to handle the practical problem of learning a universal model for crowd counting across scenes and datasets. We dissect that the crux of this problem is the catastrophic sensitivity of crowd counters to scale shift, which is very common in the real world and caused by factors such as different scene layouts and image resolutions. Therefore it is difficult to train a universal model that can be applied to various scenes. To address this problem, we propose scale alignment as a prime module for establishing a novel crowd counting framework. We derive a closed-form solution to get the optimal image rescaling factors for alignment by minimizing the distances between their scale distributions. A novel neural network together with a loss function based on an efficient sliced Wasserstein distance is also proposed for scale distribution estimation. Benefiting from the proposed method, we have learned a universal model that generally works well on several datasets where can even outperform state-of-the-art models that are particularly fine-tuned for each dataset significantly. Experiments also demonstrate the much better generalizability of our model to unseen scenes.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"54 6 1","pages":"3185-3194"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77099630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

A Machine Teaching Framework for Scalable Recognition 面向可扩展识别的机器教学框架

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00490

Pei Wang, N. Vasconcelos

We consider the scalable recognition problem in the fine-grained expert domain where large-scale data collection is easy whereas annotation is difficult. Existing solutions are typically based on semi-supervised or self-supervised learning. We propose an alternative new framework, MEMORABLE, based on machine teaching and online crowd-sourcing platforms. A small amount of data is first labeled by experts and then used to teach online annotators for the classes of interest, who finally label the entire dataset. Preliminary studies show that the accuracy of classifiers trained on the final dataset is a function of the accuracy of the student annotators. A new machine teaching algorithm, CMaxGrad, is then proposed to enhance this accuracy by introducing explanations in a state-of-the-art machine teaching algorithm. For this, CMaxGrad leverages counterfactual explanations, which take into account student predictions, thereby proving feedback that is student-specific, explicitly addresses the causes of student confusion, and adapts to the level of competence of the student. Experiments show that both MEMORABLE and CMaxGrad outperform existing solutions to their respective problems.

我们考虑了细粒度专家领域的可扩展识别问题，该领域的大规模数据收集容易，而标注困难。现有的解决方案通常基于半监督或自监督学习。我们提出了另一种新的框架，基于机器教学和在线众包平台的难忘框架。少量数据首先由专家标记，然后用于教授感兴趣的类的在线注释者，他们最终标记整个数据集。初步研究表明，在最终数据集上训练的分类器的准确性是学生注释器准确性的函数。然后提出了一种新的机器教学算法CMaxGrad，通过在最先进的机器教学算法中引入解释来提高这种准确性。为此，CMaxGrad利用反事实解释，考虑到学生的预测，从而证明反馈是针对学生的，明确地解决了学生困惑的原因，并适应学生的能力水平。实验表明，对于各自的问题，难忘和CMaxGrad都优于现有的解决方案。

{"title":"A Machine Teaching Framework for Scalable Recognition","authors":"Pei Wang, N. Vasconcelos","doi":"10.1109/ICCV48922.2021.00490","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00490","url":null,"abstract":"We consider the scalable recognition problem in the fine-grained expert domain where large-scale data collection is easy whereas annotation is difficult. Existing solutions are typically based on semi-supervised or self-supervised learning. We propose an alternative new framework, MEMORABLE, based on machine teaching and online crowd-sourcing platforms. A small amount of data is first labeled by experts and then used to teach online annotators for the classes of interest, who finally label the entire dataset. Preliminary studies show that the accuracy of classifiers trained on the final dataset is a function of the accuracy of the student annotators. A new machine teaching algorithm, CMaxGrad, is then proposed to enhance this accuracy by introducing explanations in a state-of-the-art machine teaching algorithm. For this, CMaxGrad leverages counterfactual explanations, which take into account student predictions, thereby proving feedback that is student-specific, explicitly addresses the causes of student confusion, and adapts to the level of competence of the student. Experiments show that both MEMORABLE and CMaxGrad outperform existing solutions to their respective problems.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"48 1","pages":"4925-4934"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76913312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Robust Automatic Monocular Vehicle Speed Estimation for Traffic Surveillance 基于鲁棒单目车辆速度估计的交通监控

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00451

Jérôme Revaud, M. Humenberger

Even though CCTV cameras are widely deployed for traffic surveillance and have therefore the potential of becoming cheap automated sensors for traffic speed analysis, their large-scale usage toward this goal has not been reported yet. A key difficulty lies in fact in the camera calibration phase. Existing state-of-the-art methods perform the calibration using image processing or keypoint detection techniques that require high-quality video streams, yet typical CCTV footage is low-resolution and noisy. As a result, these methods largely fail in real-world conditions. In contrast, we propose two novel calibration techniques whose only inputs come from an off-the-shelf object detector. Both methods consider multiple detections jointly, leveraging the fact that cars have similar and well-known 3D shapes with normalized dimensions. The first one is based on minimizing an energy function corresponding to a 3D reprojection error, the second one instead learns from synthetic training data to predict the scene geometry directly. Noticing the lack of speed estimation benchmarks faithfully reflecting the actual quality of surveillance cameras, we introduce a novel dataset collected from public CCTV streams. Experimental results conducted on three diverse benchmarks demonstrate excellent speed estimation accuracy that could enable the wide use of CCTV cameras for traffic analysis, even in challenging conditions where state-of-the-art methods completely fail. Additional information can be found on our project web page: https://rebrand.ly/nle-cctv

尽管闭路电视摄像机被广泛用于交通监控，因此有可能成为交通速度分析的廉价自动传感器，但它们在这一目标上的大规模使用尚未有报道。事实上，关键的困难在于相机校准阶段。现有的最先进的方法使用图像处理或关键点检测技术进行校准，这些技术需要高质量的视频流，但典型的闭路电视镜头分辨率低且有噪声。因此，这些方法在实际条件下基本上是失败的。相比之下，我们提出了两种新的校准技术，其唯一的输入来自现成的目标检测器。两种方法都联合考虑多个检测，利用汽车具有相似且众所周知的标准化尺寸的3D形状这一事实。第一种方法是基于最小化3D重投影误差对应的能量函数，第二种方法是从合成训练数据中学习直接预测场景几何形状。注意到缺乏真实反映监控摄像机实际质量的速度估计基准，我们引入了一个从公共CCTV流中收集的新数据集。在三个不同的基准测试中进行的实验结果表明，出色的速度估计精度可以使CCTV摄像机广泛用于交通分析，即使在最先进的方法完全失败的具有挑战性的条件下。更多信息可以在我们的项目网页上找到:https://rebrand.ly/nle-cctv

{"title":"Robust Automatic Monocular Vehicle Speed Estimation for Traffic Surveillance","authors":"Jérôme Revaud, M. Humenberger","doi":"10.1109/ICCV48922.2021.00451","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00451","url":null,"abstract":"Even though CCTV cameras are widely deployed for traffic surveillance and have therefore the potential of becoming cheap automated sensors for traffic speed analysis, their large-scale usage toward this goal has not been reported yet. A key difficulty lies in fact in the camera calibration phase. Existing state-of-the-art methods perform the calibration using image processing or keypoint detection techniques that require high-quality video streams, yet typical CCTV footage is low-resolution and noisy. As a result, these methods largely fail in real-world conditions. In contrast, we propose two novel calibration techniques whose only inputs come from an off-the-shelf object detector. Both methods consider multiple detections jointly, leveraging the fact that cars have similar and well-known 3D shapes with normalized dimensions. The first one is based on minimizing an energy function corresponding to a 3D reprojection error, the second one instead learns from synthetic training data to predict the scene geometry directly. Noticing the lack of speed estimation benchmarks faithfully reflecting the actual quality of surveillance cameras, we introduce a novel dataset collected from public CCTV streams. Experimental results conducted on three diverse benchmarks demonstrate excellent speed estimation accuracy that could enable the wide use of CCTV cameras for traffic analysis, even in challenging conditions where state-of-the-art methods completely fail. Additional information can be found on our project web page: https://rebrand.ly/nle-cctv","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"222 1","pages":"4531-4541"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76421028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder∗ 基于条件变分自编码器的统一三维人体运动综合模型*

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01144

Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, Xiaohui Shen, Ding Liu, N. Thalmann

We present a unified and flexible framework to address the generalized problem of 3D motion synthesis that covers the tasks of motion prediction, completion, interpolation, and spatial-temporal recovery. Since these tasks have different input constraints and various fidelity and diversity requirements, most existing approaches only cater to a specific task or use different architectures to address various tasks. Here we propose a unified framework based on Conditional Variational Auto-Encoder (CVAE), where we treat any arbitrary input as a masked motion series. Notably, by considering this problem as a conditional generation process, we estimate a parametric distribution of the missing regions based on the input conditions, from which to sample and synthesize the full motion series. To further allow the flexibility of manipulating the motion style of the generated series, we design an Action-Adaptive Modulation (AAM) to propagate the given semantic guidance through the whole sequence. We also introduce a cross-attention mechanism to exploit distant relations among decoder and encoder features for better realism and global consistency. We conducted extensive experiments on Human 3.6M and CMU-Mocap. The results show that our method produces coherent and realistic results for various motion synthesis tasks, with the synthesized motions distinctly adapted by the given action labels.

我们提出了一个统一而灵活的框架来解决3D运动合成的广义问题，该问题涵盖了运动预测，完成，插值和时空恢复的任务。由于这些任务有不同的输入约束和不同的保真度和多样性要求，大多数现有的方法只迎合特定的任务或使用不同的架构来解决不同的任务。在这里，我们提出了一个基于条件变分自编码器(CVAE)的统一框架，其中我们将任意输入视为一个被屏蔽的运动序列。值得注意的是，我们将该问题视为一个条件生成过程，根据输入条件估计缺失区域的参数分布，从中采样和合成完整的运动序列。为了进一步允许操纵生成序列的运动风格的灵活性，我们设计了一个动作自适应调制(AAM)来将给定的语义引导传播到整个序列。我们还引入了一种交叉注意机制来利用解码器和编码器特征之间的远距离关系，以获得更好的真实感和全局一致性。我们对Human 3.6M和CMU-Mocap进行了大量的实验。结果表明，我们的方法对各种运动合成任务产生了连贯和真实的结果，合成的运动明显适应给定的动作标签。

{"title":"A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder∗","authors":"Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, Xiaohui Shen, Ding Liu, N. Thalmann","doi":"10.1109/ICCV48922.2021.01144","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01144","url":null,"abstract":"We present a unified and flexible framework to address the generalized problem of 3D motion synthesis that covers the tasks of motion prediction, completion, interpolation, and spatial-temporal recovery. Since these tasks have different input constraints and various fidelity and diversity requirements, most existing approaches only cater to a specific task or use different architectures to address various tasks. Here we propose a unified framework based on Conditional Variational Auto-Encoder (CVAE), where we treat any arbitrary input as a masked motion series. Notably, by considering this problem as a conditional generation process, we estimate a parametric distribution of the missing regions based on the input conditions, from which to sample and synthesize the full motion series. To further allow the flexibility of manipulating the motion style of the generated series, we design an Action-Adaptive Modulation (AAM) to propagate the given semantic guidance through the whole sequence. We also introduce a cross-attention mechanism to exploit distant relations among decoder and encoder features for better realism and global consistency. We conducted extensive experiments on Human 3.6M and CMU-Mocap. The results show that our method produces coherent and realistic results for various motion synthesis tasks, with the synthesized motions distinctly adapted by the given action labels.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"33 1","pages":"11625-11635"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81203301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Query Adaptive Few-Shot Object Detection with Heterogeneous Graph Convolutional Networks 基于异构图卷积网络的查询自适应小目标检测

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00325

G. Han, Yicheng He, Shiyuan Huang, Jiawei Ma, Shih-Fu Chang

Few-shot object detection (FSOD) aims to detect never-seen objects using few examples. This field sees recent improvement owing to the meta-learning techniques by learning how to match between the query image and few-shot class examples, such that the learned model can generalize to few-shot novel classes. However, currently, most of the meta-learning-based methods perform parwise matching between query image regions (usually proposals) and novel classes separately, therefore failing to take into account multiple relationships among them. In this paper, we propose a novel FSOD model using heterogeneous graph convolutional networks. Through efficient message passing among all the proposal and class nodes with three different types of edges, we could obtain context-aware proposal features and query-adaptive, multiclass-enhanced prototype representations for each class, which could help promote the pairwise matching and improve final FSOD accuracy. Extensive experimental results show that our proposed model, denoted as QA-FewDet, outperforms the current state-of-the-art approaches on the PASCAL VOC and MSCOCO FSOD benchmarks under different shots and evaluation metrics.

少镜头目标检测(FSOD)的目的是用很少的例子来检测从未见过的物体。由于元学习技术通过学习如何在查询图像和少量类示例之间进行匹配，使得学习的模型可以泛化到少量新颖类，该领域最近得到了改进。然而，目前大多数基于元学习的方法分别在查询图像区域(通常是提案)和新类之间进行局部匹配，因此没有考虑到它们之间的多重关系。本文提出了一种基于异构图卷积网络的FSOD模型。通过在具有三种不同边的提议和类节点之间进行有效的消息传递，我们可以获得上下文感知的提议特征和每个类的查询自适应、多类增强的原型表示，有助于促进两两匹配，提高最终的FSOD精度。大量的实验结果表明，我们提出的模型(表示为QA-FewDet)在不同的射击和评估指标下，在PASCAL VOC和MSCOCO FSOD基准上优于当前最先进的方法。

{"title":"Query Adaptive Few-Shot Object Detection with Heterogeneous Graph Convolutional Networks","authors":"G. Han, Yicheng He, Shiyuan Huang, Jiawei Ma, Shih-Fu Chang","doi":"10.1109/ICCV48922.2021.00325","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00325","url":null,"abstract":"Few-shot object detection (FSOD) aims to detect never-seen objects using few examples. This field sees recent improvement owing to the meta-learning techniques by learning how to match between the query image and few-shot class examples, such that the learned model can generalize to few-shot novel classes. However, currently, most of the meta-learning-based methods perform parwise matching between query image regions (usually proposals) and novel classes separately, therefore failing to take into account multiple relationships among them. In this paper, we propose a novel FSOD model using heterogeneous graph convolutional networks. Through efficient message passing among all the proposal and class nodes with three different types of edges, we could obtain context-aware proposal features and query-adaptive, multiclass-enhanced prototype representations for each class, which could help promote the pairwise matching and improve final FSOD accuracy. Extensive experimental results show that our proposed model, denoted as QA-FewDet, outperforms the current state-of-the-art approaches on the PASCAL VOC and MSCOCO FSOD benchmarks under different shots and evaluation metrics.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"14 1","pages":"3243-3252"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83539527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60