首页 > 最新文献

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision最新文献

英文 中文
Real-time Online Video Detection with Temporal Smoothing Transformers 基于时间平滑变压器的实时在线视频检测
Yue Zhao, Philipp Krahenbuhl
Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS'14 dataset.
流媒体视频识别在视频的每一帧中对物体及其动作进行推理。一个好的流媒体识别模型可以同时捕捉视频的长期动态和短期变化。不幸的是,在大多数现有方法中,计算复杂度随所考虑的动态长度线性或二次增长。这个问题在基于变压器的体系结构中尤为明显。为了解决这个问题,我们通过核的视角重新表述了视频变压器中的交叉注意,并应用了两种时间平滑核:盒核或拉普拉斯核。由此产生的流注意力重用了从一帧到另一帧的大部分计算,并且每帧只需要恒定的时间更新。基于这个想法,我们构建了TeSTra,一个时间平滑变压器,它可以接收任意长的输入,并具有恒定的缓存和计算开销。具体来说,它的运行速度比同等的基于滑动窗口的变压器快6倍,在流式设置中有2,048帧。此外,由于增加了时间跨度,TeSTra在THUMOS'14和EPIC-Kitchen-100这两个标准的在线动作检测和动作预期数据集上取得了最先进的结果。实时版本的TeSTra在THUMOS'14数据集上的表现优于其他所有方法。
{"title":"Real-time Online Video Detection with Temporal Smoothing Transformers","authors":"Yue Zhao, Philipp Krahenbuhl","doi":"10.48550/arXiv.2209.09236","DOIUrl":"https://doi.org/10.48550/arXiv.2209.09236","url":null,"abstract":"Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS'14 dataset.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75930175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
RVSL: Robust Vehicle Similarity Learning in Real Hazy Scenes Based on Semi-supervised Learning 基于半监督学习的真实朦胧场景鲁棒车辆相似学习
Wei-Ting Chen, I-Hsiang Chen, C. Yeh, Han Yang, Hua-En Chang, Jianwei Ding, Sy-Yen Kuo
Recently, vehicle similarity learning, also called re-identification (ReID), has attracted significant attention in computer vision. Several algorithms have been developed and obtained considerable success. However, most existing methods have unpleasant performance in the hazy scenario due to poor visibility. Though some strategies are possible to resolve this problem, they still have room to be improved due to the limited performance in real-world scenarios and the lack of real-world clear ground truth. Thus, to resolve this problem, inspired by CycleGAN, we construct a training paradigm called textbf{RVSL} which integrates ReID and domain transformation techniques. The network is trained on semi-supervised fashion and does not require to employ the ID labels and the corresponding clear ground truths to learn hazy vehicle ReID mission in the real-world haze scenes. To further constrain the unsupervised learning process effectively, several losses are developed. Experimental results on synthetic and real-world datasets indicate that the proposed method can achieve state-of-the-art performance on hazy vehicle ReID problems. It is worth mentioning that although the proposed method is trained without real-world label information, it can achieve competitive performance compared to existing supervised methods trained on complete label information.
近年来,车辆相似学习,也称为再识别(ReID),在计算机视觉领域引起了广泛的关注。已经开发了几种算法并取得了相当大的成功。然而,由于能见度差,现有的大多数方法在雾霾场景下的性能都不理想。虽然有一些策略可以解决这个问题,但由于在现实场景中的性能有限,并且缺乏现实世界明确的事实,它们仍然有改进的空间。因此,为了解决这个问题,受CycleGAN的启发,我们构建了一个名为textbf{RVSL}的训练范式,该范式集成了ReID和域转换技术。该网络采用半监督方式进行训练,不需要使用ID标签和相应的清晰地面事实来学习真实雾霾场景下的雾霾车辆ReID任务。为了进一步有效地约束无监督学习过程,提出了几种损失算法。在合成数据集和真实数据集上的实验结果表明,本文提出的方法可以在雾天车辆ReID问题上达到最先进的性能。值得一提的是,尽管所提出的方法是在没有真实世界标签信息的情况下进行训练的,但与现有的在完整标签信息上训练的监督方法相比,它可以获得具有竞争力的性能。
{"title":"RVSL: Robust Vehicle Similarity Learning in Real Hazy Scenes Based on Semi-supervised Learning","authors":"Wei-Ting Chen, I-Hsiang Chen, C. Yeh, Han Yang, Hua-En Chang, Jianwei Ding, Sy-Yen Kuo","doi":"10.48550/arXiv.2209.08630","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08630","url":null,"abstract":"Recently, vehicle similarity learning, also called re-identification (ReID), has attracted significant attention in computer vision. Several algorithms have been developed and obtained considerable success. However, most existing methods have unpleasant performance in the hazy scenario due to poor visibility. Though some strategies are possible to resolve this problem, they still have room to be improved due to the limited performance in real-world scenarios and the lack of real-world clear ground truth. Thus, to resolve this problem, inspired by CycleGAN, we construct a training paradigm called textbf{RVSL} which integrates ReID and domain transformation techniques. The network is trained on semi-supervised fashion and does not require to employ the ID labels and the corresponding clear ground truths to learn hazy vehicle ReID mission in the real-world haze scenes. To further constrain the unsupervised learning process effectively, several losses are developed. Experimental results on synthetic and real-world datasets indicate that the proposed method can achieve state-of-the-art performance on hazy vehicle ReID problems. It is worth mentioning that although the proposed method is trained without real-world label information, it can achieve competitive performance compared to existing supervised methods trained on complete label information.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81307260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
ActiveNeRF: Learning where to See with Uncertainty Estimation ActiveNeRF:学习在不确定性评估中看到什么
Xuran Pan, Zihang Lai, Shiji Song, Gao Huang
Recently, Neural Radiance Fields (NeRF) has shown promising performances on reconstructing 3D scenes and synthesizing novel views from a sparse set of 2D images. Albeit effective, the performance of NeRF is highly influenced by the quality of training samples. With limited posed images from the scene, NeRF fails to generalize well to novel views and may collapse to trivial solutions in unobserved regions. This makes NeRF impractical under resource-constrained scenarios. In this paper, we present a novel learning framework, ActiveNeRF, aiming to model a 3D scene with a constrained input budget. Specifically, we first incorporate uncertainty estimation into a NeRF model, which ensures robustness under few observations and provides an interpretation of how NeRF understands the scene. On this basis, we propose to supplement the existing training set with newly captured samples based on an active learning scheme. By evaluating the reduction of uncertainty given new inputs, we select the samples that bring the most information gain. In this way, the quality of novel view synthesis can be improved with minimal additional resources. Extensive experiments validate the performance of our model on both realistic and synthetic scenes, especially with scarcer training data. Code will be released at url{https://github.com/LeapLabTHU/ActiveNeRF}.
近年来,神经辐射场(Neural Radiance Fields, NeRF)在重建3D场景和从稀疏的2D图像合成新视图方面表现出了良好的性能。尽管NeRF是有效的,但其性能受到训练样本质量的高度影响。由于来自场景的有限的摆拍图像,NeRF不能很好地推广到新的视图,并且可能在未观察到的区域崩溃为平凡的解决方案。这使得NeRF在资源受限的情况下不切实际。在本文中,我们提出了一个新的学习框架,ActiveNeRF,旨在模拟一个具有有限输入预算的3D场景。具体来说,我们首先将不确定性估计纳入NeRF模型,该模型确保了在少量观察下的鲁棒性,并提供了NeRF如何理解场景的解释。在此基础上,我们提出基于主动学习方案,用新捕获的样本补充现有的训练集。通过评估给定新输入的不确定性的减少,我们选择带来最多信息增益的样本。通过这种方式,可以用最少的额外资源来提高新视图合成的质量。大量的实验验证了我们的模型在真实场景和合成场景上的性能,特别是在训练数据较少的情况下。代码将在url{https://github.com/LeapLabTHU/ActiveNeRF}上发布。
{"title":"ActiveNeRF: Learning where to See with Uncertainty Estimation","authors":"Xuran Pan, Zihang Lai, Shiji Song, Gao Huang","doi":"10.48550/arXiv.2209.08546","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08546","url":null,"abstract":"Recently, Neural Radiance Fields (NeRF) has shown promising performances on reconstructing 3D scenes and synthesizing novel views from a sparse set of 2D images. Albeit effective, the performance of NeRF is highly influenced by the quality of training samples. With limited posed images from the scene, NeRF fails to generalize well to novel views and may collapse to trivial solutions in unobserved regions. This makes NeRF impractical under resource-constrained scenarios. In this paper, we present a novel learning framework, ActiveNeRF, aiming to model a 3D scene with a constrained input budget. Specifically, we first incorporate uncertainty estimation into a NeRF model, which ensures robustness under few observations and provides an interpretation of how NeRF understands the scene. On this basis, we propose to supplement the existing training set with newly captured samples based on an active learning scheme. By evaluating the reduction of uncertainty given new inputs, we select the samples that bring the most information gain. In this way, the quality of novel view synthesis can be improved with minimal additional resources. Extensive experiments validate the performance of our model on both realistic and synthetic scenes, especially with scarcer training data. Code will be released at url{https://github.com/LeapLabTHU/ActiveNeRF}.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89756708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Learning to Weight Samples for Dynamic Early-exiting Networks 动态早期存在网络的样本加权学习
Yizeng Han, Yifan Pu, Zihang Lai, Chaofei Wang, S. Song, Junfen Cao, Wenhui Huang, Chao Deng, Gao Huang
Early exiting is an effective paradigm for improving the inference efficiency of deep networks. By constructing classifiers with varying resource demands (the exits), such networks allow easy samples to be output at early exits, removing the need for executing deeper layers. While existing works mainly focus on the architectural design of multi-exit networks, the training strategies for such models are largely left unexplored. The current state-of-the-art models treat all samples the same during training. However, the early-exiting behavior during testing has been ignored, leading to a gap between training and testing. In this paper, we propose to bridge this gap by sample weighting. Intuitively, easy samples, which generally exit early in the network during inference, should contribute more to training early classifiers. The training of hard samples (mostly exit from deeper layers), however, should be emphasized by the late classifiers. Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit. This weight prediction network and the backbone model are jointly optimized under a meta-learning framework with a novel optimization objective. By bringing the adaptive behavior during inference into the training phase, we show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency. Code is available at https://github.com/LeapLabTHU/L2W-DEN.
早期退出是提高深度网络推理效率的有效范例。通过构造具有不同资源需求(出口)的分类器,这样的网络允许在早期出口输出简单的样本,从而消除了执行更深层次的需要。虽然现有的工作主要集中在多出口网络的架构设计上,但这些模型的训练策略在很大程度上没有被探索。目前最先进的模型在训练过程中对待所有样本都是一样的。然而,在测试过程中,早期退出行为被忽视,导致训练和测试之间的差距。在本文中,我们建议通过样本加权来弥补这一差距。直观地说,容易的样本通常在推理过程中较早退出网络,应该对训练早期分类器有更大的贡献。然而,后期分类器应该强调硬样本(大多来自更深层)的训练。我们的工作建议采用权重预测网络对每个出口不同训练样本的损失进行加权。该权重预测网络和骨干模型在元学习框架下进行了联合优化,并提出了新的优化目标。通过将推理过程中的自适应行为引入训练阶段,我们证明了所提出的加权机制能够持续改善分类精度和推理效率之间的权衡。代码可从https://github.com/LeapLabTHU/L2W-DEN获得。
{"title":"Learning to Weight Samples for Dynamic Early-exiting Networks","authors":"Yizeng Han, Yifan Pu, Zihang Lai, Chaofei Wang, S. Song, Junfen Cao, Wenhui Huang, Chao Deng, Gao Huang","doi":"10.48550/arXiv.2209.08310","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08310","url":null,"abstract":"Early exiting is an effective paradigm for improving the inference efficiency of deep networks. By constructing classifiers with varying resource demands (the exits), such networks allow easy samples to be output at early exits, removing the need for executing deeper layers. While existing works mainly focus on the architectural design of multi-exit networks, the training strategies for such models are largely left unexplored. The current state-of-the-art models treat all samples the same during training. However, the early-exiting behavior during testing has been ignored, leading to a gap between training and testing. In this paper, we propose to bridge this gap by sample weighting. Intuitively, easy samples, which generally exit early in the network during inference, should contribute more to training early classifiers. The training of hard samples (mostly exit from deeper layers), however, should be emphasized by the late classifiers. Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit. This weight prediction network and the backbone model are jointly optimized under a meta-learning framework with a novel optimization objective. By bringing the adaptive behavior during inference into the training phase, we show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency. Code is available at https://github.com/LeapLabTHU/L2W-DEN.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80061552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation PPT:用于单目和多视图人体姿态估计的标记修剪姿势转换器
Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, Xiaohui Xie
Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results.
近年来,视觉变换及其变体在单眼和多视角人体姿态估计中发挥着越来越重要的作用。将图像补丁视为令牌,转换器可以对整个图像或来自其他视图的图像中的全局依赖关系进行建模。然而,全局注意力在计算上是昂贵的。因此,很难将这些基于变压器的方法扩展到高分辨率特征和多视图。在本文中,我们提出了用于二维人体姿态估计的标记修剪姿势转换器(PPT),它可以定位粗略的人体面具,并仅在选定的标记内进行自关注。此外,我们将我们的PPT扩展到多视图人体姿态估计。在PPT的基础上,我们提出了一种新的交叉视图融合策略,称为人体区域融合,该策略将所有人体前景像素作为相应的候选者。在COCO和MPII上的实验结果表明,我们的PPT在减少计算量的同时可以达到之前的位姿变换方法的精度。此外,在Human 3.6M和Ski-Pose上的实验表明,我们的多视图PPT可以有效地融合来自多个视图的线索,并获得最新的效果。
{"title":"PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation","authors":"Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, Xiaohui Xie","doi":"10.48550/arXiv.2209.08194","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08194","url":null,"abstract":"Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86145355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A Large-scale Multiple-objective Method for Black-box Attack against Object Detection 针对目标检测的大规模多目标黑盒攻击方法
Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, Xiaochun Cao
Recent studies have shown that detectors based on deep models are vulnerable to adversarial examples, even in the black-box scenario where the attacker cannot access the model information. Most existing attack methods aim to minimize the true positive rate, which often shows poor attack performance, as another sub-optimal bounding box may be detected around the attacked bounding box to be the new true positive one. To settle this challenge, we propose to minimize the true positive rate and maximize the false positive rate, which can encourage more false positive objects to block the generation of new true positive bounding boxes. It is modeled as a multi-objective optimization (MOP) problem, of which the generic algorithm can search the Pareto-optimal. However, our task has more than two million decision variables, leading to low searching efficiency. Thus, we extend the standard Genetic Algorithm with Random Subset selection and Divide-and-Conquer, called GARSDC, which significantly improves the efficiency. Moreover, to alleviate the sensitivity to population quality in generic algorithms, we generate a gradient-prior initial population, utilizing the transferability between different detectors with similar backbones. Compared with the state-of-art attack methods, GARSDC decreases by an average 12.0 in the mAP and queries by about 1000 times in extensive experiments. Our codes can be found at https://github.com/LiangSiyuan21/ GARSDC.
最近的研究表明,基于深度模型的检测器容易受到对抗性示例的攻击,即使在攻击者无法访问模型信息的黑箱场景中也是如此。现有的攻击方法大多以最小化真正率为目标,这往往导致攻击性能不佳,因为在被攻击的包围盒周围可能会发现另一个次优包围盒作为新的真正包围盒。为了解决这一挑战,我们提出了最小化真阳性率和最大化假阳性率,这可以鼓励更多的假阳性对象阻止新的真阳性边界框的生成。将其建模为一个多目标优化(MOP)问题,其中通用算法可以搜索到pareto最优。然而,我们的任务有超过200万个决策变量,导致搜索效率很低。因此,我们将标准遗传算法扩展为随机子集选择和分而治之,称为GARSDC,显著提高了效率。此外,为了减轻一般算法对种群质量的敏感性,我们利用具有相似主干的不同检测器之间的可转移性,生成梯度先验初始种群。与现有的攻击方法相比,在mAP下GARSDC平均降低12.0,在大量实验中查询次数降低约1000次。我们的代码可以在https://github.com/LiangSiyuan21/ GARSDC上找到。
{"title":"A Large-scale Multiple-objective Method for Black-box Attack against Object Detection","authors":"Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, Xiaochun Cao","doi":"10.48550/arXiv.2209.07790","DOIUrl":"https://doi.org/10.48550/arXiv.2209.07790","url":null,"abstract":"Recent studies have shown that detectors based on deep models are vulnerable to adversarial examples, even in the black-box scenario where the attacker cannot access the model information. Most existing attack methods aim to minimize the true positive rate, which often shows poor attack performance, as another sub-optimal bounding box may be detected around the attacked bounding box to be the new true positive one. To settle this challenge, we propose to minimize the true positive rate and maximize the false positive rate, which can encourage more false positive objects to block the generation of new true positive bounding boxes. It is modeled as a multi-objective optimization (MOP) problem, of which the generic algorithm can search the Pareto-optimal. However, our task has more than two million decision variables, leading to low searching efficiency. Thus, we extend the standard Genetic Algorithm with Random Subset selection and Divide-and-Conquer, called GARSDC, which significantly improves the efficiency. Moreover, to alleviate the sensitivity to population quality in generic algorithms, we generate a gradient-prior initial population, utilizing the transferability between different detectors with similar backbones. Compared with the state-of-art attack methods, GARSDC decreases by an average 12.0 in the mAP and queries by about 1000 times in extensive experiments. Our codes can be found at https://github.com/LiangSiyuan21/ GARSDC.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78716046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Deep Moving-camera Background Model 一种深度移动相机背景模型
Guy Erez, R. Weber, O. Freifeld
In video analysis, background models have many applications such as background/foreground separation, change detection, anomaly detection, tracking, and more. However, while learning such a model in a video captured by a static camera is a fairly-solved task, in the case of a Moving-camera Background Model (MCBM), the success has been far more modest due to algorithmic and scalability challenges that arise due to the camera motion. Thus, existing MCBMs are limited in their scope and their supported camera-motion types. These hurdles also impeded the employment, in this unsupervised task, of end-to-end solutions based on deep learning (DL). Moreover, existing MCBMs usually model the background either on the domain of a typically-large panoramic image or in an online fashion. Unfortunately, the former creates several problems, including poor scalability, while the latter prevents the recognition and leveraging of cases where the camera revisits previously-seen parts of the scene. This paper proposes a new method, called DeepMCBM, that eliminates all the aforementioned issues and achieves state-of-the-art results. Concretely, first we identify the difficulties associated with joint alignment of video frames in general and in a DL setting in particular. Next, we propose a new strategy for joint alignment that lets us use a spatial transformer net with neither a regularization nor any form of specialized (and non-differentiable) initialization. Coupled with an autoencoder conditioned on unwarped robust central moments (obtained from the joint alignment), this yields an end-to-end regularization-free MCBM that supports a broad range of camera motions and scales gracefully. We demonstrate DeepMCBM's utility on a variety of videos, including ones beyond the scope of other methods. Our code is available at https://github.com/BGU-CS-VIL/DeepMCBM .
在视频分析中,背景模型有许多应用,如背景/前景分离、变化检测、异常检测、跟踪等。然而,虽然在静态摄像机捕获的视频中学习这样的模型是一个相当解决的任务,但在移动摄像机背景模型(MCBM)的情况下,由于摄像机运动引起的算法和可扩展性挑战,成功的程度要小得多。因此,现有的mcbm在其范围和支持的相机运动类型方面受到限制。这些障碍也阻碍了基于深度学习(DL)的端到端解决方案在无监督任务中的应用。此外,现有的mcbm通常在典型的大型全景图像的域上或以在线方式对背景进行建模。不幸的是,前者产生了几个问题,包括较差的可扩展性,而后者阻止识别和利用摄像机重新访问以前看到的场景部分的情况。本文提出了一种名为DeepMCBM的新方法,它消除了上述所有问题,并获得了最先进的结果。具体地说,首先我们确定了与视频帧的联合对齐相关的困难,特别是在DL设置中。接下来,我们提出了一种新的联合对齐策略,该策略允许我们使用既没有正则化也没有任何形式的专门(和不可微)初始化的空间变压器网。再加上一个基于无扭曲鲁棒中心矩(从关节对准中获得)的自编码器,这产生了一个端到端无正则化的MCBM,支持广泛的相机运动和优雅的缩放。我们在各种视频上演示了DeepMCBM的实用程序,包括超出其他方法范围的视频。我们的代码可在https://github.com/BGU-CS-VIL/DeepMCBM上获得。
{"title":"A Deep Moving-camera Background Model","authors":"Guy Erez, R. Weber, O. Freifeld","doi":"10.48550/arXiv.2209.07923","DOIUrl":"https://doi.org/10.48550/arXiv.2209.07923","url":null,"abstract":"In video analysis, background models have many applications such as background/foreground separation, change detection, anomaly detection, tracking, and more. However, while learning such a model in a video captured by a static camera is a fairly-solved task, in the case of a Moving-camera Background Model (MCBM), the success has been far more modest due to algorithmic and scalability challenges that arise due to the camera motion. Thus, existing MCBMs are limited in their scope and their supported camera-motion types. These hurdles also impeded the employment, in this unsupervised task, of end-to-end solutions based on deep learning (DL). Moreover, existing MCBMs usually model the background either on the domain of a typically-large panoramic image or in an online fashion. Unfortunately, the former creates several problems, including poor scalability, while the latter prevents the recognition and leveraging of cases where the camera revisits previously-seen parts of the scene. This paper proposes a new method, called DeepMCBM, that eliminates all the aforementioned issues and achieves state-of-the-art results. Concretely, first we identify the difficulties associated with joint alignment of video frames in general and in a DL setting in particular. Next, we propose a new strategy for joint alignment that lets us use a spatial transformer net with neither a regularization nor any form of specialized (and non-differentiable) initialization. Coupled with an autoencoder conditioned on unwarped robust central moments (obtained from the joint alignment), this yields an end-to-end regularization-free MCBM that supports a broad range of camera motions and scales gracefully. We demonstrate DeepMCBM's utility on a variety of videos, including ones beyond the scope of other methods. Our code is available at https://github.com/BGU-CS-VIL/DeepMCBM .","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72603246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Self-distilled Feature Aggregation for Self-supervised Monocular Depth Estimation 基于自监督单目深度估计的自提取特征聚合
Zhengming Zhou, Qiulei Dong
Self-supervised monocular depth estimation has received much attention recently in computer vision. Most of the existing works in literature aggregate multi-scale features for depth prediction via either straightforward concatenation or element-wise addition, however, such feature aggregation operations generally neglect the contextual consistency between multi-scale features. Addressing this problem, we propose the Self-Distilled Feature Aggregation (SDFA) module for simultaneously aggregating a pair of low-scale and high-scale features and maintaining their contextual consistency. The SDFA employs three branches to learn three feature offset maps respectively: one offset map for refining the input low-scale feature and the other two for refining the input high-scale feature under a designed self-distillation manner. Then, we propose an SDFA-based network for self-supervised monocular depth estimation, and design a self-distilled training strategy to train the proposed network with the SDFA module. Experimental results on the KITTI dataset demonstrate that the proposed method outperforms the comparative state-of-the-art methods in most cases. The code is available at https://github.com/ZM-Zhou/SDFA-Net_pytorch.
自监督单目深度估计是近年来计算机视觉领域研究的热点。现有文献大多通过直接拼接或元素相加的方式聚合多尺度特征进行深度预测,但这种特征聚合操作往往忽略了多尺度特征之间的上下文一致性。针对这一问题,我们提出了自蒸馏特征聚合(SDFA)模块,用于同时聚合一对低规模和高规模特征并保持其上下文一致性。SDFA采用三个分支分别学习三个特征偏移映射:一个偏移映射用于细化输入的低尺度特征,另外两个偏移映射用于细化输入的高尺度特征,并采用设计的自蒸馏方式。然后,我们提出了一种基于SDFA的自监督单目深度估计网络,并设计了一种自蒸馏训练策略,利用SDFA模块对所提出的网络进行训练。在KITTI数据集上的实验结果表明,该方法在大多数情况下优于比较先进的方法。代码可在https://github.com/ZM-Zhou/SDFA-Net_pytorch上获得。
{"title":"Self-distilled Feature Aggregation for Self-supervised Monocular Depth Estimation","authors":"Zhengming Zhou, Qiulei Dong","doi":"10.48550/arXiv.2209.07088","DOIUrl":"https://doi.org/10.48550/arXiv.2209.07088","url":null,"abstract":"Self-supervised monocular depth estimation has received much attention recently in computer vision. Most of the existing works in literature aggregate multi-scale features for depth prediction via either straightforward concatenation or element-wise addition, however, such feature aggregation operations generally neglect the contextual consistency between multi-scale features. Addressing this problem, we propose the Self-Distilled Feature Aggregation (SDFA) module for simultaneously aggregating a pair of low-scale and high-scale features and maintaining their contextual consistency. The SDFA employs three branches to learn three feature offset maps respectively: one offset map for refining the input low-scale feature and the other two for refining the input high-scale feature under a designed self-distillation manner. Then, we propose an SDFA-based network for self-supervised monocular depth estimation, and design a self-distilled training strategy to train the proposed network with the SDFA module. Experimental results on the KITTI dataset demonstrate that the proposed method outperforms the comparative state-of-the-art methods in most cases. The code is available at https://github.com/ZM-Zhou/SDFA-Net_pytorch.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84354826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
DevNet: Self-supervised Monocular Depth Learning via Density Volume Construction DevNet:通过密度体积构建的自监督单目深度学习
Kaichen Zhou, Lanqing Hong, Changhao Chen, Hang Xu, Chao Ye, Qingyong Hu, Zhenguo Li
Self-supervised depth learning from monocular images normally relies on the 2D pixel-wise photometric relation between temporally adjacent image frames. However, they neither fully exploit the 3D point-wise geometric correspondences, nor effectively tackle the ambiguities in the photometric warping caused by occlusions or illumination inconsistency. To address these problems, this work proposes Density Volume Construction Network (DevNet), a novel self-supervised monocular depth learning framework, that can consider 3D spatial information, and exploit stronger geometric constraints among adjacent camera frustums. Instead of directly regressing the pixel value from a single image, our DevNet divides the camera frustum into multiple parallel planes and predicts the pointwise occlusion probability density on each plane. The final depth map is generated by integrating the density along corresponding rays. During the training process, novel regularization strategies and loss functions are introduced to mitigate photometric ambiguities and overfitting. Without obviously enlarging model parameters size or running time, DevNet outperforms several representative baselines on both the KITTI-2015 outdoor dataset and NYU-V2 indoor dataset. In particular, the root-mean-square-deviation is reduced by around 4% with DevNet on both KITTI-2015 and NYU-V2 in the task of depth estimation. Code is available at https://github.com/gitkaichenzhou/DevNet.
单眼图像的自监督深度学习通常依赖于时间相邻图像帧之间的二维逐像素光度关系。然而,它们既不能充分利用三维逐点几何对应关系,也不能有效地解决由遮挡或光照不一致引起的光度扭曲中的模糊性。为了解决这些问题,本研究提出了密度体积构建网络(DevNet),这是一种新颖的自监督单目深度学习框架,可以考虑3D空间信息,并利用相邻相机平台之间更强的几何约束。我们的DevNet不是直接从单个图像中回归像素值,而是将相机截锥体划分为多个平行平面,并预测每个平面上的逐点遮挡概率密度。最终的深度图是通过对相应光线的密度积分生成的。在训练过程中,引入了新的正则化策略和损失函数来减轻光度模糊和过拟合。在没有明显扩大模型参数大小或运行时间的情况下,DevNet在KITTI-2015室外数据集和NYU-V2室内数据集上的表现优于几个代表性基线。特别是,在KITTI-2015和NYU-V2的深度估计任务中,使用DevNet将均方根偏差降低了约4%。代码可从https://github.com/gitkaichenzhou/DevNet获得。
{"title":"DevNet: Self-supervised Monocular Depth Learning via Density Volume Construction","authors":"Kaichen Zhou, Lanqing Hong, Changhao Chen, Hang Xu, Chao Ye, Qingyong Hu, Zhenguo Li","doi":"10.48550/arXiv.2209.06351","DOIUrl":"https://doi.org/10.48550/arXiv.2209.06351","url":null,"abstract":"Self-supervised depth learning from monocular images normally relies on the 2D pixel-wise photometric relation between temporally adjacent image frames. However, they neither fully exploit the 3D point-wise geometric correspondences, nor effectively tackle the ambiguities in the photometric warping caused by occlusions or illumination inconsistency. To address these problems, this work proposes Density Volume Construction Network (DevNet), a novel self-supervised monocular depth learning framework, that can consider 3D spatial information, and exploit stronger geometric constraints among adjacent camera frustums. Instead of directly regressing the pixel value from a single image, our DevNet divides the camera frustum into multiple parallel planes and predicts the pointwise occlusion probability density on each plane. The final depth map is generated by integrating the density along corresponding rays. During the training process, novel regularization strategies and loss functions are introduced to mitigate photometric ambiguities and overfitting. Without obviously enlarging model parameters size or running time, DevNet outperforms several representative baselines on both the KITTI-2015 outdoor dataset and NYU-V2 indoor dataset. In particular, the root-mean-square-deviation is reduced by around 4% with DevNet on both KITTI-2015 and NYU-V2 in the task of depth estimation. Code is available at https://github.com/gitkaichenzhou/DevNet.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85845112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural Features 基于神经特征粗到精绘制的鲁棒类别级6D姿态估计
Wufei Ma, Angtian Wang, A. Yuille, Adam Kortylewski
We consider the problem of category-level 6D pose estimation from a single RGB image. Our approach represents an object category as a cuboid mesh and learns a generative model of the neural feature activations at each mesh vertex to perform pose estimation through differentiable rendering. A common problem of rendering-based approaches is that they rely on bounding box proposals, which do not convey information about the 3D rotation of the object and are not reliable when objects are partially occluded. Instead, we introduce a coarse-to-fine optimization strategy that utilizes the rendering process to estimate a sparse set of 6D object proposals, which are subsequently refined with gradient-based optimization. The key to enabling the convergence of our approach is a neural feature representation that is trained to be scale- and rotation-invariant using contrastive learning. Our experiments demonstrate an enhanced category-level 6D pose estimation performance compared to prior work, particularly under strong partial occlusion.
我们考虑了从单个RGB图像中估计类别级6D姿态的问题。我们的方法将一个对象类别表示为一个长方体网格,并学习每个网格顶点的神经特征激活的生成模型,通过可微渲染来执行姿态估计。基于渲染的方法的一个常见问题是,它们依赖于边界框建议,这些建议不能传达物体的3D旋转信息,并且在物体部分遮挡时不可靠。相反,我们引入了一种从粗到精的优化策略,该策略利用渲染过程来估计6D对象建议的稀疏集,随后使用基于梯度的优化对其进行细化。使我们的方法收敛的关键是使用对比学习训练成尺度和旋转不变的神经特征表示。与之前的工作相比,我们的实验证明了增强的类别级6D姿态估计性能,特别是在强部分遮挡下。
{"title":"Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural Features","authors":"Wufei Ma, Angtian Wang, A. Yuille, Adam Kortylewski","doi":"10.48550/arXiv.2209.05624","DOIUrl":"https://doi.org/10.48550/arXiv.2209.05624","url":null,"abstract":"We consider the problem of category-level 6D pose estimation from a single RGB image. Our approach represents an object category as a cuboid mesh and learns a generative model of the neural feature activations at each mesh vertex to perform pose estimation through differentiable rendering. A common problem of rendering-based approaches is that they rely on bounding box proposals, which do not convey information about the 3D rotation of the object and are not reliable when objects are partially occluded. Instead, we introduce a coarse-to-fine optimization strategy that utilizes the rendering process to estimate a sparse set of 6D object proposals, which are subsequently refined with gradient-based optimization. The key to enabling the convergence of our approach is a neural feature representation that is trained to be scale- and rotation-invariant using contrastive learning. Our experiments demonstrate an enhanced category-level 6D pose estimation performance compared to prior work, particularly under strong partial occlusion.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77724814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1