Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision最新文献_第8页

HVC-Net: Unifying Homography, Visibility, and Confidence Learning for Planar Object Tracking HVC-Net:平面对象跟踪的统一单应性、可见性和置信度学习

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-19 DOI: 10.48550/arXiv.2209.08924

Haoxian Zhang, Yonggen Ling

Robust and accurate planar tracking over a whole video sequence is vitally important for many vision applications. The key to planar object tracking is to find object correspondences, modeled by homography, between the reference image and the tracked image. Existing methods tend to obtain wrong correspondences with changing appearance variations, camera-object relative motions and occlusions. To alleviate this problem, we present a unified convolutional neural network (CNN) model that jointly considers homography, visibility, and confidence. First, we introduce correlation blocks that explicitly account for the local appearance changes and camera-object relative motions as the base of our model. Second, we jointly learn the homography and visibility that links camera-object relative motions with occlusions. Third, we propose a confidence module that actively monitors the estimation quality from the pixel correlation distributions obtained in correlation blocks. All these modules are plugged into a Lucas-Kanade (LK) tracking pipeline to obtain both accurate and robust planar object tracking. Our approach outperforms the state-of-the-art methods on public POT and TMT datasets. Its superior performance is also verified on a real-world application, synthesizing high-quality in-video advertisements.

在许多视觉应用中，对整个视频序列进行鲁棒和精确的平面跟踪是至关重要的。平面目标跟踪的关键是找到参考图像与被跟踪图像之间的对象对应关系。现有的方法往往得到错误的对应变化的外观变化，相机-对象相对运动和遮挡。为了缓解这个问题，我们提出了一个统一的卷积神经网络(CNN)模型，该模型联合考虑了单应性、可见性和置信度。首先，我们引入了相关块，明确地解释了局部外观变化和相机-物体相对运动作为我们模型的基础。其次，我们共同学习了将相机-物体相对运动与遮挡联系起来的单应性和可见性。第三，我们提出了一个置信度模块，从相关块中获得的像素相关分布中主动监控估计质量。所有这些模块都插入到卢卡斯-卡纳德(LK)跟踪管道中，以获得精确和鲁棒的平面目标跟踪。我们的方法在公共POT和TMT数据集上优于最先进的方法。其优越的性能也在实际应用中得到验证，合成了高质量的视频内广告。

{"title":"HVC-Net: Unifying Homography, Visibility, and Confidence Learning for Planar Object Tracking","authors":"Haoxian Zhang, Yonggen Ling","doi":"10.48550/arXiv.2209.08924","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08924","url":null,"abstract":"Robust and accurate planar tracking over a whole video sequence is vitally important for many vision applications. The key to planar object tracking is to find object correspondences, modeled by homography, between the reference image and the tracked image. Existing methods tend to obtain wrong correspondences with changing appearance variations, camera-object relative motions and occlusions. To alleviate this problem, we present a unified convolutional neural network (CNN) model that jointly considers homography, visibility, and confidence. First, we introduce correlation blocks that explicitly account for the local appearance changes and camera-object relative motions as the base of our model. Second, we jointly learn the homography and visibility that links camera-object relative motions with occlusions. Third, we propose a confidence module that actively monitors the estimation quality from the pixel correlation distributions obtained in correlation blocks. All these modules are plugged into a Lucas-Kanade (LK) tracking pipeline to obtain both accurate and robust planar object tracking. Our approach outperforms the state-of-the-art methods on public POT and TMT datasets. Its superior performance is also verified on a real-world application, synthesizing high-quality in-video advertisements.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"26 1","pages":"701-718"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77366297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

D&D: Learning Human Dynamics from Dynamic Camera 龙与地下城:从动态摄像机学习人类动态

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-19 DOI: 10.48550/arXiv.2209.08790

Jiefeng Li, Siyuan Bian, Chaoshun Xu, Gang Liu, Gang Yu, Cewu Lu

3D human pose estimation from a monocular video has recently seen significant improvements. However, most state-of-the-art methods are kinematics-based, which are prone to physically implausible motions with pronounced artifacts. Current dynamics-based methods can predict physically plausible motion but are restricted to simple scenarios with static camera view. In this work, we present D&D (Learning Human Dynamics from Dynamic Camera), which leverages the laws of physics to reconstruct 3D human motion from the in-the-wild videos with a moving camera. D&D introduces inertial force control (IFC) to explain the 3D human motion in the non-inertial local frame by considering the inertial forces of the dynamic camera. To learn the ground contact with limited annotations, we develop probabilistic contact torque (PCT), which is computed by differentiable sampling from contact probabilities and used to generate motions. The contact state can be weakly supervised by encouraging the model to generate correct motions. Furthermore, we propose an attentive PD controller that adjusts target pose states using temporal information to obtain smooth and accurate pose control. Our approach is entirely neural-based and runs without offline optimization or simulation in physics engines. Experiments on large-scale 3D human motion benchmarks demonstrate the effectiveness of D&D, where we exhibit superior performance against both state-of-the-art kinematics-based and dynamics-based methods. Code is available at https://github.com/Jeffsjtu/DnD

单目视频的3D人体姿态估计最近有了显著的改进。然而，大多数最先进的方法是基于运动学的，这很容易产生物理上难以置信的运动和明显的伪影。目前基于动态的方法可以预测物理上合理的运动，但仅限于静态摄像机视图的简单场景。在这项工作中，我们提出了D&D(从动态摄像机学习人类动力学)，它利用物理定律从移动摄像机的野外视频中重建3D人体运动。龙与地下城引入惯性力控制(IFC)，通过考虑动态摄像机的惯性力来解释非惯性局部坐标系中的三维人体运动。为了学习具有有限注释的地面接触，我们开发了概率接触扭矩(PCT)，该扭矩由接触概率的可微采样计算并用于生成运动。通过鼓励模型产生正确的运动，可以对接触状态进行弱监督。此外，我们提出了一种专注PD控制器，利用时间信息调整目标姿态状态，以获得平滑和准确的姿态控制。我们的方法完全是基于神经的，无需在物理引擎中进行离线优化或模拟。大规模3D人体运动基准实验证明了D&D的有效性，我们在最先进的基于运动学和基于动力学的方法中都表现出卓越的性能。代码可从https://github.com/Jeffsjtu/DnD获得

{"title":"D&D: Learning Human Dynamics from Dynamic Camera","authors":"Jiefeng Li, Siyuan Bian, Chaoshun Xu, Gang Liu, Gang Yu, Cewu Lu","doi":"10.48550/arXiv.2209.08790","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08790","url":null,"abstract":"3D human pose estimation from a monocular video has recently seen significant improvements. However, most state-of-the-art methods are kinematics-based, which are prone to physically implausible motions with pronounced artifacts. Current dynamics-based methods can predict physically plausible motion but are restricted to simple scenarios with static camera view. In this work, we present D&D (Learning Human Dynamics from Dynamic Camera), which leverages the laws of physics to reconstruct 3D human motion from the in-the-wild videos with a moving camera. D&D introduces inertial force control (IFC) to explain the 3D human motion in the non-inertial local frame by considering the inertial forces of the dynamic camera. To learn the ground contact with limited annotations, we develop probabilistic contact torque (PCT), which is computed by differentiable sampling from contact probabilities and used to generate motions. The contact state can be weakly supervised by encouraging the model to generate correct motions. Furthermore, we propose an attentive PD controller that adjusts target pose states using temporal information to obtain smooth and accurate pose control. Our approach is entirely neural-based and runs without offline optimization or simulation in physics engines. Experiments on large-scale 3D human motion benchmarks demonstrate the effectiveness of D&D, where we exhibit superior performance against both state-of-the-art kinematics-based and dynamics-based methods. Code is available at https://github.com/Jeffsjtu/DnD","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"19 1","pages":"479-496"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88062279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Real-time Online Video Detection with Temporal Smoothing Transformers 基于时间平滑变压器的实时在线视频检测

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-19 DOI: 10.48550/arXiv.2209.09236

Yue Zhao, Philipp Krahenbuhl

Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS'14 dataset.

流媒体视频识别在视频的每一帧中对物体及其动作进行推理。一个好的流媒体识别模型可以同时捕捉视频的长期动态和短期变化。不幸的是，在大多数现有方法中，计算复杂度随所考虑的动态长度线性或二次增长。这个问题在基于变压器的体系结构中尤为明显。为了解决这个问题，我们通过核的视角重新表述了视频变压器中的交叉注意，并应用了两种时间平滑核:盒核或拉普拉斯核。由此产生的流注意力重用了从一帧到另一帧的大部分计算，并且每帧只需要恒定的时间更新。基于这个想法，我们构建了TeSTra，一个时间平滑变压器，它可以接收任意长的输入，并具有恒定的缓存和计算开销。具体来说，它的运行速度比同等的基于滑动窗口的变压器快6倍，在流式设置中有2,048帧。此外，由于增加了时间跨度，TeSTra在THUMOS'14和EPIC-Kitchen-100这两个标准的在线动作检测和动作预期数据集上取得了最先进的结果。实时版本的TeSTra在THUMOS'14数据集上的表现优于其他所有方法。

{"title":"Real-time Online Video Detection with Temporal Smoothing Transformers","authors":"Yue Zhao, Philipp Krahenbuhl","doi":"10.48550/arXiv.2209.09236","DOIUrl":"https://doi.org/10.48550/arXiv.2209.09236","url":null,"abstract":"Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS'14 dataset.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"103 1","pages":"485-502"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75930175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

RVSL: Robust Vehicle Similarity Learning in Real Hazy Scenes Based on Semi-supervised Learning 基于半监督学习的真实朦胧场景鲁棒车辆相似学习

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-18 DOI: 10.48550/arXiv.2209.08630

Wei-Ting Chen, I-Hsiang Chen, C. Yeh, Han Yang, Hua-En Chang, Jianwei Ding, Sy-Yen Kuo

Recently, vehicle similarity learning, also called re-identification (ReID), has attracted significant attention in computer vision. Several algorithms have been developed and obtained considerable success. However, most existing methods have unpleasant performance in the hazy scenario due to poor visibility. Though some strategies are possible to resolve this problem, they still have room to be improved due to the limited performance in real-world scenarios and the lack of real-world clear ground truth. Thus, to resolve this problem, inspired by CycleGAN, we construct a training paradigm called textbf{RVSL} which integrates ReID and domain transformation techniques. The network is trained on semi-supervised fashion and does not require to employ the ID labels and the corresponding clear ground truths to learn hazy vehicle ReID mission in the real-world haze scenes. To further constrain the unsupervised learning process effectively, several losses are developed. Experimental results on synthetic and real-world datasets indicate that the proposed method can achieve state-of-the-art performance on hazy vehicle ReID problems. It is worth mentioning that although the proposed method is trained without real-world label information, it can achieve competitive performance compared to existing supervised methods trained on complete label information.

近年来，车辆相似学习，也称为再识别(ReID)，在计算机视觉领域引起了广泛的关注。已经开发了几种算法并取得了相当大的成功。然而，由于能见度差，现有的大多数方法在雾霾场景下的性能都不理想。虽然有一些策略可以解决这个问题，但由于在现实场景中的性能有限，并且缺乏现实世界明确的事实，它们仍然有改进的空间。因此，为了解决这个问题，受CycleGAN的启发，我们构建了一个名为textbf{RVSL}的训练范式，该范式集成了ReID和域转换技术。该网络采用半监督方式进行训练，不需要使用ID标签和相应的清晰地面事实来学习真实雾霾场景下的雾霾车辆ReID任务。为了进一步有效地约束无监督学习过程，提出了几种损失算法。在合成数据集和真实数据集上的实验结果表明，本文提出的方法可以在雾天车辆ReID问题上达到最先进的性能。值得一提的是，尽管所提出的方法是在没有真实世界标签信息的情况下进行训练的，但与现有的在完整标签信息上训练的监督方法相比，它可以获得具有竞争力的性能。

{"title":"RVSL: Robust Vehicle Similarity Learning in Real Hazy Scenes Based on Semi-supervised Learning","authors":"Wei-Ting Chen, I-Hsiang Chen, C. Yeh, Han Yang, Hua-En Chang, Jianwei Ding, Sy-Yen Kuo","doi":"10.48550/arXiv.2209.08630","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08630","url":null,"abstract":"Recently, vehicle similarity learning, also called re-identification (ReID), has attracted significant attention in computer vision. Several algorithms have been developed and obtained considerable success. However, most existing methods have unpleasant performance in the hazy scenario due to poor visibility. Though some strategies are possible to resolve this problem, they still have room to be improved due to the limited performance in real-world scenarios and the lack of real-world clear ground truth. Thus, to resolve this problem, inspired by CycleGAN, we construct a training paradigm called textbf{RVSL} which integrates ReID and domain transformation techniques. The network is trained on semi-supervised fashion and does not require to employ the ID labels and the corresponding clear ground truths to learn hazy vehicle ReID mission in the real-world haze scenes. To further constrain the unsupervised learning process effectively, several losses are developed. Experimental results on synthetic and real-world datasets indicate that the proposed method can achieve state-of-the-art performance on hazy vehicle ReID problems. It is worth mentioning that although the proposed method is trained without real-world label information, it can achieve competitive performance compared to existing supervised methods trained on complete label information.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"62 1","pages":"427-443"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81307260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ActiveNeRF: Learning where to See with Uncertainty Estimation ActiveNeRF:学习在不确定性评估中看到什么

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-18 DOI: 10.48550/arXiv.2209.08546

Xuran Pan, Zihang Lai, Shiji Song, Gao Huang

Recently, Neural Radiance Fields (NeRF) has shown promising performances on reconstructing 3D scenes and synthesizing novel views from a sparse set of 2D images. Albeit effective, the performance of NeRF is highly influenced by the quality of training samples. With limited posed images from the scene, NeRF fails to generalize well to novel views and may collapse to trivial solutions in unobserved regions. This makes NeRF impractical under resource-constrained scenarios. In this paper, we present a novel learning framework, ActiveNeRF, aiming to model a 3D scene with a constrained input budget. Specifically, we first incorporate uncertainty estimation into a NeRF model, which ensures robustness under few observations and provides an interpretation of how NeRF understands the scene. On this basis, we propose to supplement the existing training set with newly captured samples based on an active learning scheme. By evaluating the reduction of uncertainty given new inputs, we select the samples that bring the most information gain. In this way, the quality of novel view synthesis can be improved with minimal additional resources. Extensive experiments validate the performance of our model on both realistic and synthetic scenes, especially with scarcer training data. Code will be released at url{https://github.com/LeapLabTHU/ActiveNeRF}.

近年来，神经辐射场(Neural Radiance Fields, NeRF)在重建3D场景和从稀疏的2D图像合成新视图方面表现出了良好的性能。尽管NeRF是有效的，但其性能受到训练样本质量的高度影响。由于来自场景的有限的摆拍图像，NeRF不能很好地推广到新的视图，并且可能在未观察到的区域崩溃为平凡的解决方案。这使得NeRF在资源受限的情况下不切实际。在本文中，我们提出了一个新的学习框架，ActiveNeRF，旨在模拟一个具有有限输入预算的3D场景。具体来说，我们首先将不确定性估计纳入NeRF模型，该模型确保了在少量观察下的鲁棒性，并提供了NeRF如何理解场景的解释。在此基础上，我们提出基于主动学习方案，用新捕获的样本补充现有的训练集。通过评估给定新输入的不确定性的减少，我们选择带来最多信息增益的样本。通过这种方式，可以用最少的额外资源来提高新视图合成的质量。大量的实验验证了我们的模型在真实场景和合成场景上的性能，特别是在训练数据较少的情况下。代码将在url{https://github.com/LeapLabTHU/ActiveNeRF}上发布。

{"title":"ActiveNeRF: Learning where to See with Uncertainty Estimation","authors":"Xuran Pan, Zihang Lai, Shiji Song, Gao Huang","doi":"10.48550/arXiv.2209.08546","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08546","url":null,"abstract":"Recently, Neural Radiance Fields (NeRF) has shown promising performances on reconstructing 3D scenes and synthesizing novel views from a sparse set of 2D images. Albeit effective, the performance of NeRF is highly influenced by the quality of training samples. With limited posed images from the scene, NeRF fails to generalize well to novel views and may collapse to trivial solutions in unobserved regions. This makes NeRF impractical under resource-constrained scenarios. In this paper, we present a novel learning framework, ActiveNeRF, aiming to model a 3D scene with a constrained input budget. Specifically, we first incorporate uncertainty estimation into a NeRF model, which ensures robustness under few observations and provides an interpretation of how NeRF understands the scene. On this basis, we propose to supplement the existing training set with newly captured samples based on an active learning scheme. By evaluating the reduction of uncertainty given new inputs, we select the samples that bring the most information gain. In this way, the quality of novel view synthesis can be improved with minimal additional resources. Extensive experiments validate the performance of our model on both realistic and synthetic scenes, especially with scarcer training data. Code will be released at url{https://github.com/LeapLabTHU/ActiveNeRF}.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"57 1","pages":"230-246"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89756708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Learning to Weight Samples for Dynamic Early-exiting Networks 动态早期存在网络的样本加权学习

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-17 DOI: 10.48550/arXiv.2209.08310

Yizeng Han, Yifan Pu, Zihang Lai, Chaofei Wang, S. Song, Junfen Cao, Wenhui Huang, Chao Deng, Gao Huang

Early exiting is an effective paradigm for improving the inference efficiency of deep networks. By constructing classifiers with varying resource demands (the exits), such networks allow easy samples to be output at early exits, removing the need for executing deeper layers. While existing works mainly focus on the architectural design of multi-exit networks, the training strategies for such models are largely left unexplored. The current state-of-the-art models treat all samples the same during training. However, the early-exiting behavior during testing has been ignored, leading to a gap between training and testing. In this paper, we propose to bridge this gap by sample weighting. Intuitively, easy samples, which generally exit early in the network during inference, should contribute more to training early classifiers. The training of hard samples (mostly exit from deeper layers), however, should be emphasized by the late classifiers. Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit. This weight prediction network and the backbone model are jointly optimized under a meta-learning framework with a novel optimization objective. By bringing the adaptive behavior during inference into the training phase, we show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency. Code is available at https://github.com/LeapLabTHU/L2W-DEN.

早期退出是提高深度网络推理效率的有效范例。通过构造具有不同资源需求(出口)的分类器，这样的网络允许在早期出口输出简单的样本，从而消除了执行更深层次的需要。虽然现有的工作主要集中在多出口网络的架构设计上，但这些模型的训练策略在很大程度上没有被探索。目前最先进的模型在训练过程中对待所有样本都是一样的。然而，在测试过程中，早期退出行为被忽视，导致训练和测试之间的差距。在本文中，我们建议通过样本加权来弥补这一差距。直观地说，容易的样本通常在推理过程中较早退出网络，应该对训练早期分类器有更大的贡献。然而，后期分类器应该强调硬样本(大多来自更深层)的训练。我们的工作建议采用权重预测网络对每个出口不同训练样本的损失进行加权。该权重预测网络和骨干模型在元学习框架下进行了联合优化，并提出了新的优化目标。通过将推理过程中的自适应行为引入训练阶段，我们证明了所提出的加权机制能够持续改善分类精度和推理效率之间的权衡。代码可从https://github.com/LeapLabTHU/L2W-DEN获得。

{"title":"Learning to Weight Samples for Dynamic Early-exiting Networks","authors":"Yizeng Han, Yifan Pu, Zihang Lai, Chaofei Wang, S. Song, Junfen Cao, Wenhui Huang, Chao Deng, Gao Huang","doi":"10.48550/arXiv.2209.08310","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08310","url":null,"abstract":"Early exiting is an effective paradigm for improving the inference efficiency of deep networks. By constructing classifiers with varying resource demands (the exits), such networks allow easy samples to be output at early exits, removing the need for executing deeper layers. While existing works mainly focus on the architectural design of multi-exit networks, the training strategies for such models are largely left unexplored. The current state-of-the-art models treat all samples the same during training. However, the early-exiting behavior during testing has been ignored, leading to a gap between training and testing. In this paper, we propose to bridge this gap by sample weighting. Intuitively, easy samples, which generally exit early in the network during inference, should contribute more to training early classifiers. The training of hard samples (mostly exit from deeper layers), however, should be emphasized by the late classifiers. Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit. This weight prediction network and the backbone model are jointly optimized under a meta-learning framework with a novel optimization objective. By bringing the adaptive behavior during inference into the training phase, we show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency. Code is available at https://github.com/LeapLabTHU/L2W-DEN.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"49 1","pages":"362-378"},"PeriodicalIF":0.0,"publicationDate":"2022-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80061552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation PPT:用于单目和多视图人体姿态估计的标记修剪姿势转换器

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-16 DOI: 10.48550/arXiv.2209.08194

Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, Xiaohui Xie

Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results.

近年来，视觉变换及其变体在单眼和多视角人体姿态估计中发挥着越来越重要的作用。将图像补丁视为令牌，转换器可以对整个图像或来自其他视图的图像中的全局依赖关系进行建模。然而，全局注意力在计算上是昂贵的。因此，很难将这些基于变压器的方法扩展到高分辨率特征和多视图。在本文中，我们提出了用于二维人体姿态估计的标记修剪姿势转换器(PPT)，它可以定位粗略的人体面具，并仅在选定的标记内进行自关注。此外，我们将我们的PPT扩展到多视图人体姿态估计。在PPT的基础上，我们提出了一种新的交叉视图融合策略，称为人体区域融合，该策略将所有人体前景像素作为相应的候选者。在COCO和MPII上的实验结果表明，我们的PPT在减少计算量的同时可以达到之前的位姿变换方法的精度。此外，在Human 3.6M和Ski-Pose上的实验表明，我们的多视图PPT可以有效地融合来自多个视图的线索，并获得最新的效果。

{"title":"PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation","authors":"Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, Xiaohui Xie","doi":"10.48550/arXiv.2209.08194","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08194","url":null,"abstract":"Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"16 1","pages":"424-442"},"PeriodicalIF":0.0,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86145355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

A Large-scale Multiple-objective Method for Black-box Attack against Object Detection 针对目标检测的大规模多目标黑盒攻击方法

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-16 DOI: 10.48550/arXiv.2209.07790

Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, Xiaochun Cao

Recent studies have shown that detectors based on deep models are vulnerable to adversarial examples, even in the black-box scenario where the attacker cannot access the model information. Most existing attack methods aim to minimize the true positive rate, which often shows poor attack performance, as another sub-optimal bounding box may be detected around the attacked bounding box to be the new true positive one. To settle this challenge, we propose to minimize the true positive rate and maximize the false positive rate, which can encourage more false positive objects to block the generation of new true positive bounding boxes. It is modeled as a multi-objective optimization (MOP) problem, of which the generic algorithm can search the Pareto-optimal. However, our task has more than two million decision variables, leading to low searching efficiency. Thus, we extend the standard Genetic Algorithm with Random Subset selection and Divide-and-Conquer, called GARSDC, which significantly improves the efficiency. Moreover, to alleviate the sensitivity to population quality in generic algorithms, we generate a gradient-prior initial population, utilizing the transferability between different detectors with similar backbones. Compared with the state-of-art attack methods, GARSDC decreases by an average 12.0 in the mAP and queries by about 1000 times in extensive experiments. Our codes can be found at https://github.com/LiangSiyuan21/ GARSDC.

最近的研究表明，基于深度模型的检测器容易受到对抗性示例的攻击，即使在攻击者无法访问模型信息的黑箱场景中也是如此。现有的攻击方法大多以最小化真正率为目标，这往往导致攻击性能不佳，因为在被攻击的包围盒周围可能会发现另一个次优包围盒作为新的真正包围盒。为了解决这一挑战，我们提出了最小化真阳性率和最大化假阳性率，这可以鼓励更多的假阳性对象阻止新的真阳性边界框的生成。将其建模为一个多目标优化(MOP)问题，其中通用算法可以搜索到pareto最优。然而，我们的任务有超过200万个决策变量，导致搜索效率很低。因此，我们将标准遗传算法扩展为随机子集选择和分而治之，称为GARSDC，显著提高了效率。此外，为了减轻一般算法对种群质量的敏感性，我们利用具有相似主干的不同检测器之间的可转移性，生成梯度先验初始种群。与现有的攻击方法相比，在mAP下GARSDC平均降低12.0，在大量实验中查询次数降低约1000次。我们的代码可以在https://github.com/LiangSiyuan21/ GARSDC上找到。

{"title":"A Large-scale Multiple-objective Method for Black-box Attack against Object Detection","authors":"Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, Xiaochun Cao","doi":"10.48550/arXiv.2209.07790","DOIUrl":"https://doi.org/10.48550/arXiv.2209.07790","url":null,"abstract":"Recent studies have shown that detectors based on deep models are vulnerable to adversarial examples, even in the black-box scenario where the attacker cannot access the model information. Most existing attack methods aim to minimize the true positive rate, which often shows poor attack performance, as another sub-optimal bounding box may be detected around the attacked bounding box to be the new true positive one. To settle this challenge, we propose to minimize the true positive rate and maximize the false positive rate, which can encourage more false positive objects to block the generation of new true positive bounding boxes. It is modeled as a multi-objective optimization (MOP) problem, of which the generic algorithm can search the Pareto-optimal. However, our task has more than two million decision variables, leading to low searching efficiency. Thus, we extend the standard Genetic Algorithm with Random Subset selection and Divide-and-Conquer, called GARSDC, which significantly improves the efficiency. Moreover, to alleviate the sensitivity to population quality in generic algorithms, we generate a gradient-prior initial population, utilizing the transferability between different detectors with similar backbones. Compared with the state-of-art attack methods, GARSDC decreases by an average 12.0 in the mAP and queries by about 1000 times in extensive experiments. Our codes can be found at https://github.com/LiangSiyuan21/ GARSDC.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"15 1","pages":"619-636"},"PeriodicalIF":0.0,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78716046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Deep Moving-camera Background Model 一种深度移动相机背景模型

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-16 DOI: 10.48550/arXiv.2209.07923

Guy Erez, R. Weber, O. Freifeld

In video analysis, background models have many applications such as background/foreground separation, change detection, anomaly detection, tracking, and more. However, while learning such a model in a video captured by a static camera is a fairly-solved task, in the case of a Moving-camera Background Model (MCBM), the success has been far more modest due to algorithmic and scalability challenges that arise due to the camera motion. Thus, existing MCBMs are limited in their scope and their supported camera-motion types. These hurdles also impeded the employment, in this unsupervised task, of end-to-end solutions based on deep learning (DL). Moreover, existing MCBMs usually model the background either on the domain of a typically-large panoramic image or in an online fashion. Unfortunately, the former creates several problems, including poor scalability, while the latter prevents the recognition and leveraging of cases where the camera revisits previously-seen parts of the scene. This paper proposes a new method, called DeepMCBM, that eliminates all the aforementioned issues and achieves state-of-the-art results. Concretely, first we identify the difficulties associated with joint alignment of video frames in general and in a DL setting in particular. Next, we propose a new strategy for joint alignment that lets us use a spatial transformer net with neither a regularization nor any form of specialized (and non-differentiable) initialization. Coupled with an autoencoder conditioned on unwarped robust central moments (obtained from the joint alignment), this yields an end-to-end regularization-free MCBM that supports a broad range of camera motions and scales gracefully. We demonstrate DeepMCBM's utility on a variety of videos, including ones beyond the scope of other methods. Our code is available at https://github.com/BGU-CS-VIL/DeepMCBM .

在视频分析中，背景模型有许多应用，如背景/前景分离、变化检测、异常检测、跟踪等。然而，虽然在静态摄像机捕获的视频中学习这样的模型是一个相当解决的任务，但在移动摄像机背景模型(MCBM)的情况下，由于摄像机运动引起的算法和可扩展性挑战，成功的程度要小得多。因此，现有的mcbm在其范围和支持的相机运动类型方面受到限制。这些障碍也阻碍了基于深度学习(DL)的端到端解决方案在无监督任务中的应用。此外，现有的mcbm通常在典型的大型全景图像的域上或以在线方式对背景进行建模。不幸的是，前者产生了几个问题，包括较差的可扩展性，而后者阻止识别和利用摄像机重新访问以前看到的场景部分的情况。本文提出了一种名为DeepMCBM的新方法，它消除了上述所有问题，并获得了最先进的结果。具体地说，首先我们确定了与视频帧的联合对齐相关的困难，特别是在DL设置中。接下来，我们提出了一种新的联合对齐策略，该策略允许我们使用既没有正则化也没有任何形式的专门(和不可微)初始化的空间变压器网。再加上一个基于无扭曲鲁棒中心矩(从关节对准中获得)的自编码器，这产生了一个端到端无正则化的MCBM，支持广泛的相机运动和优雅的缩放。我们在各种视频上演示了DeepMCBM的实用程序，包括超出其他方法范围的视频。我们的代码可在https://github.com/BGU-CS-VIL/DeepMCBM上获得。

{"title":"A Deep Moving-camera Background Model","authors":"Guy Erez, R. Weber, O. Freifeld","doi":"10.48550/arXiv.2209.07923","DOIUrl":"https://doi.org/10.48550/arXiv.2209.07923","url":null,"abstract":"In video analysis, background models have many applications such as background/foreground separation, change detection, anomaly detection, tracking, and more. However, while learning such a model in a video captured by a static camera is a fairly-solved task, in the case of a Moving-camera Background Model (MCBM), the success has been far more modest due to algorithmic and scalability challenges that arise due to the camera motion. Thus, existing MCBMs are limited in their scope and their supported camera-motion types. These hurdles also impeded the employment, in this unsupervised task, of end-to-end solutions based on deep learning (DL). Moreover, existing MCBMs usually model the background either on the domain of a typically-large panoramic image or in an online fashion. Unfortunately, the former creates several problems, including poor scalability, while the latter prevents the recognition and leveraging of cases where the camera revisits previously-seen parts of the scene. This paper proposes a new method, called DeepMCBM, that eliminates all the aforementioned issues and achieves state-of-the-art results. Concretely, first we identify the difficulties associated with joint alignment of video frames in general and in a DL setting in particular. Next, we propose a new strategy for joint alignment that lets us use a spatial transformer net with neither a regularization nor any form of specialized (and non-differentiable) initialization. Coupled with an autoencoder conditioned on unwarped robust central moments (obtained from the joint alignment), this yields an end-to-end regularization-free MCBM that supports a broad range of camera motions and scales gracefully. We demonstrate DeepMCBM's utility on a variety of videos, including ones beyond the scope of other methods. Our code is available at https://github.com/BGU-CS-VIL/DeepMCBM .","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"36 2","pages":"177-194"},"PeriodicalIF":0.0,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72603246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Self-distilled Feature Aggregation for Self-supervised Monocular Depth Estimation 基于自监督单目深度估计的自提取特征聚合

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-15 DOI: 10.48550/arXiv.2209.07088

Zhengming Zhou, Qiulei Dong

Self-supervised monocular depth estimation has received much attention recently in computer vision. Most of the existing works in literature aggregate multi-scale features for depth prediction via either straightforward concatenation or element-wise addition, however, such feature aggregation operations generally neglect the contextual consistency between multi-scale features. Addressing this problem, we propose the Self-Distilled Feature Aggregation (SDFA) module for simultaneously aggregating a pair of low-scale and high-scale features and maintaining their contextual consistency. The SDFA employs three branches to learn three feature offset maps respectively: one offset map for refining the input low-scale feature and the other two for refining the input high-scale feature under a designed self-distillation manner. Then, we propose an SDFA-based network for self-supervised monocular depth estimation, and design a self-distilled training strategy to train the proposed network with the SDFA module. Experimental results on the KITTI dataset demonstrate that the proposed method outperforms the comparative state-of-the-art methods in most cases. The code is available at https://github.com/ZM-Zhou/SDFA-Net_pytorch.

自监督单目深度估计是近年来计算机视觉领域研究的热点。现有文献大多通过直接拼接或元素相加的方式聚合多尺度特征进行深度预测，但这种特征聚合操作往往忽略了多尺度特征之间的上下文一致性。针对这一问题，我们提出了自蒸馏特征聚合(SDFA)模块，用于同时聚合一对低规模和高规模特征并保持其上下文一致性。SDFA采用三个分支分别学习三个特征偏移映射:一个偏移映射用于细化输入的低尺度特征，另外两个偏移映射用于细化输入的高尺度特征，并采用设计的自蒸馏方式。然后，我们提出了一种基于SDFA的自监督单目深度估计网络，并设计了一种自蒸馏训练策略，利用SDFA模块对所提出的网络进行训练。在KITTI数据集上的实验结果表明，该方法在大多数情况下优于比较先进的方法。代码可在https://github.com/ZM-Zhou/SDFA-Net_pytorch上获得。

{"title":"Self-distilled Feature Aggregation for Self-supervised Monocular Depth Estimation","authors":"Zhengming Zhou, Qiulei Dong","doi":"10.48550/arXiv.2209.07088","DOIUrl":"https://doi.org/10.48550/arXiv.2209.07088","url":null,"abstract":"Self-supervised monocular depth estimation has received much attention recently in computer vision. Most of the existing works in literature aggregate multi-scale features for depth prediction via either straightforward concatenation or element-wise addition, however, such feature aggregation operations generally neglect the contextual consistency between multi-scale features. Addressing this problem, we propose the Self-Distilled Feature Aggregation (SDFA) module for simultaneously aggregating a pair of low-scale and high-scale features and maintaining their contextual consistency. The SDFA employs three branches to learn three feature offset maps respectively: one offset map for refining the input low-scale feature and the other two for refining the input high-scale feature under a designed self-distillation manner. Then, we propose an SDFA-based network for self-supervised monocular depth estimation, and design a self-distilled training strategy to train the proposed network with the SDFA module. Experimental results on the KITTI dataset demonstrate that the proposed method outperforms the comparative state-of-the-art methods in most cases. The code is available at https://github.com/ZM-Zhou/SDFA-Net_pytorch.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"30 1","pages":"709-726"},"PeriodicalIF":0.0,"publicationDate":"2022-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84354826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14