首页 > 最新文献

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision最新文献

英文 中文
Photo-realistic Neural Domain Randomization 逼真的神经域随机化
Sergey Zakharov, Rares Ambrus, V. Guizilini, Wadim Kehl, Adrien Gaidon
{"title":"Photo-realistic Neural Domain Randomization","authors":"Sergey Zakharov, Rares Ambrus, V. Guizilini, Wadim Kehl, Adrien Gaidon","doi":"10.1007/978-3-031-19806-9_18","DOIUrl":"https://doi.org/10.1007/978-3-031-19806-9_18","url":null,"abstract":"","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"90 1","pages":"310-327"},"PeriodicalIF":0.0,"publicationDate":"2022-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81463215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
PoseScript: 3D Human Poses from Natural Language postscript:来自自然语言的3D人体姿势
Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, F. Moreno-Noguer, Grégory Rogez
Natural language is leveraged in many computer vision tasks such as image captioning, cross-modal retrieval or visual question answering, to provide fine-grained semantic information. While human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. In this work, we introduce the PoseScript dataset, which pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. To increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, we propose an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information -- the posecodes -- using a set of simple but generic rules on the 3D keypoints. The posecodes are then combined into higher level textual descriptions using syntactic rules. Automatic annotations substantially increase the amount of available data, and make it possible to effectively pretrain deep models for finetuning on human captions. To demonstrate the potential of annotated poses, we show applications of the PoseScript dataset to retrieval of relevant poses from large-scale datasets and to synthetic pose generation, both based on a textual pose description.
自然语言在许多计算机视觉任务中被利用,如图像字幕、跨模态检索或视觉问答,以提供细粒度的语义信息。虽然人体姿势是人类理解的关键,但目前的3D人体姿势数据集缺乏详细的语言描述。在这项工作中,我们引入了PoseScript数据集,该数据集将来自AMASS的数千个3D人体姿势与丰富的人体部位及其空间关系的人类注释描述配对。为了将该数据集的大小增加到与典型的数据饥渴学习算法兼容的规模,我们提出了一个精心设计的字幕过程,该过程可以从给定的3D关键点生成自然语言的自动合成描述。这个过程使用一组简单但通用的3D关键点规则提取低级姿态信息。然后使用语法规则将这些叠码组合成更高级的文本描述。自动注释大大增加了可用数据的数量,并使有效地预训练深度模型以微调人类标题成为可能。为了展示姿势标注的潜力,我们展示了PoseScript数据集在从大规模数据集中检索相关姿势和合成姿势生成方面的应用,两者都基于文本姿势描述。
{"title":"PoseScript: 3D Human Poses from Natural Language","authors":"Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, F. Moreno-Noguer, Grégory Rogez","doi":"10.48550/arXiv.2210.11795","DOIUrl":"https://doi.org/10.48550/arXiv.2210.11795","url":null,"abstract":"Natural language is leveraged in many computer vision tasks such as image captioning, cross-modal retrieval or visual question answering, to provide fine-grained semantic information. While human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. In this work, we introduce the PoseScript dataset, which pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. To increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, we propose an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information -- the posecodes -- using a set of simple but generic rules on the 3D keypoints. The posecodes are then combined into higher level textual descriptions using syntactic rules. Automatic annotations substantially increase the amount of available data, and make it possible to effectively pretrain deep models for finetuning on human captions. To demonstrate the potential of annotated poses, we show applications of the PoseScript dataset to retrieval of relevant poses from large-scale datasets and to synthetic pose generation, both based on a textual pose description.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"9 1","pages":"346-362"},"PeriodicalIF":0.0,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76715550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Distilling the Undistillable: Learning from a Nasty Teacher 提炼不可提炼的东西:向一个讨厌的老师学习
Surgan Jandial, Yash Khasbage, Arghya Pal, V. Balasubramanian, Balaji Krishnamurthy
The inadvertent stealing of private/sensitive information using Knowledge Distillation (KD) has been getting significant attention recently and has guided subsequent defense efforts considering its critical nature. Recent work Nasty Teacher proposed to develop teachers which can not be distilled or imitated by models attacking it. However, the promise of confidentiality offered by a nasty teacher is not well studied, and as a further step to strengthen against such loopholes, we attempt to bypass its defense and steal (or extract) information in its presence successfully. Specifically, we analyze Nasty Teacher from two different directions and subsequently leverage them carefully to develop simple yet efficient methodologies, named as HTC and SCM, which increase the learning from Nasty Teacher by upto 68.63% on standard datasets. Additionally, we also explore an improvised defense method based on our insights of stealing. Our detailed set of experiments and ablations on diverse models/settings demonstrate the efficacy of our approach.
最近,利用知识蒸馏(Knowledge Distillation, KD)技术无意中窃取私人/敏感信息的行为引起了人们的极大关注,并指导了后续的防御工作。最近的作品《讨厌的老师》提出要培养那些不能被攻击它的模式所提炼和模仿的教师。然而,一个讨厌的老师所提供的保密承诺并没有得到很好的研究,作为进一步加强对这些漏洞的防御,我们试图绕过它的防御,并成功地窃取(或提取)信息。具体来说,我们从两个不同的方向分析了Nasty Teacher,然后仔细地利用它们来开发简单而有效的方法,称为HTC和SCM,这将从Nasty Teacher中获得的学习在标准数据集上提高了68.63%。此外,我们还探索了一种基于我们对偷窃的见解的临时防御方法。我们在不同模型/设置上的详细实验和消融证明了我们方法的有效性。
{"title":"Distilling the Undistillable: Learning from a Nasty Teacher","authors":"Surgan Jandial, Yash Khasbage, Arghya Pal, V. Balasubramanian, Balaji Krishnamurthy","doi":"10.48550/arXiv.2210.11728","DOIUrl":"https://doi.org/10.48550/arXiv.2210.11728","url":null,"abstract":"The inadvertent stealing of private/sensitive information using Knowledge Distillation (KD) has been getting significant attention recently and has guided subsequent defense efforts considering its critical nature. Recent work Nasty Teacher proposed to develop teachers which can not be distilled or imitated by models attacking it. However, the promise of confidentiality offered by a nasty teacher is not well studied, and as a further step to strengthen against such loopholes, we attempt to bypass its defense and steal (or extract) information in its presence successfully. Specifically, we analyze Nasty Teacher from two different directions and subsequently leverage them carefully to develop simple yet efficient methodologies, named as HTC and SCM, which increase the learning from Nasty Teacher by upto 68.63% on standard datasets. Additionally, we also explore an improvised defense method based on our insights of stealing. Our detailed set of experiments and ablations on diverse models/settings demonstrate the efficacy of our approach.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"60 1","pages":"587-603"},"PeriodicalIF":0.0,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90853053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
GraphCSPN: Geometry-Aware Depth Completion via Dynamic GCNs GraphCSPN:基于动态GCNs的几何感知深度补全
Xin Liu, Xiaofei Shao, Boqian Wang, Yali Li, Shengjin Wang
Image guided depth completion aims to recover per-pixel dense depth maps from sparse depth measurements with the help of aligned color images, which has a wide range of applications from robotics to autonomous driving. However, the 3D nature of sparse-to-dense depth completion has not been fully explored by previous methods. In this work, we propose a Graph Convolution based Spatial Propagation Network (GraphCSPN) as a general approach for depth completion. First, unlike previous methods, we leverage convolution neural networks as well as graph neural networks in a complementary way for geometric representation learning. In addition, the proposed networks explicitly incorporate learnable geometric constraints to regularize the propagation process performed in three-dimensional space rather than in two-dimensional plane. Furthermore, we construct the graph utilizing sequences of feature patches, and update it dynamically with an edge attention module during propagation, so as to better capture both the local neighboring features and global relationships over long distance. Extensive experiments on both indoor NYU-Depth-v2 and outdoor KITTI datasets demonstrate that our method achieves the state-of-the-art performance, especially when compared in the case of using only a few propagation steps. Code and models are available at the project page.
图像引导深度补全旨在借助对齐的彩色图像从稀疏深度测量中恢复每像素密集深度地图,从机器人到自动驾驶都有广泛的应用。然而,以前的方法尚未充分探索稀疏到密集深度完井的三维性质。在这项工作中,我们提出了一种基于图卷积的空间传播网络(GraphCSPN)作为深度补全的通用方法。首先,与以前的方法不同,我们利用卷积神经网络和图神经网络以互补的方式进行几何表示学习。此外,所提出的网络明确地结合了可学习的几何约束,以正则化在三维空间而不是二维平面上进行的传播过程。此外,我们利用特征补丁序列构建图,并在传播过程中使用边缘关注模块动态更新图,从而更好地捕获局部相邻特征和远距离全局关系。在室内NYU-Depth-v2和室外KITTI数据集上进行的大量实验表明,我们的方法达到了最先进的性能,特别是在仅使用几个传播步骤的情况下进行比较时。代码和模型可在项目页面中获得。
{"title":"GraphCSPN: Geometry-Aware Depth Completion via Dynamic GCNs","authors":"Xin Liu, Xiaofei Shao, Boqian Wang, Yali Li, Shengjin Wang","doi":"10.48550/arXiv.2210.10758","DOIUrl":"https://doi.org/10.48550/arXiv.2210.10758","url":null,"abstract":"Image guided depth completion aims to recover per-pixel dense depth maps from sparse depth measurements with the help of aligned color images, which has a wide range of applications from robotics to autonomous driving. However, the 3D nature of sparse-to-dense depth completion has not been fully explored by previous methods. In this work, we propose a Graph Convolution based Spatial Propagation Network (GraphCSPN) as a general approach for depth completion. First, unlike previous methods, we leverage convolution neural networks as well as graph neural networks in a complementary way for geometric representation learning. In addition, the proposed networks explicitly incorporate learnable geometric constraints to regularize the propagation process performed in three-dimensional space rather than in two-dimensional plane. Furthermore, we construct the graph utilizing sequences of feature patches, and update it dynamically with an edge attention module during propagation, so as to better capture both the local neighboring features and global relationships over long distance. Extensive experiments on both indoor NYU-Depth-v2 and outdoor KITTI datasets demonstrate that our method achieves the state-of-the-art performance, especially when compared in the case of using only a few propagation steps. Code and models are available at the project page.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"55 1","pages":"90-107"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78228323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LaMAR: Benchmarking Localization and Mapping for Augmented Reality LaMAR:增强现实的基准定位和映射
Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L. Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, O. Mikšík, M. Pollefeys
Localization and mapping is the foundational technology for augmented reality (AR) that enables sharing and persistence of digital content in the real world. While significant progress has been made, researchers are still mostly driven by unrealistic benchmarks not representative of real-world AR scenarios. These benchmarks are often based on small-scale datasets with low scene diversity, captured from stationary cameras, and lack other sensor inputs like inertial, radio, or depth data. Furthermore, their ground-truth (GT) accuracy is mostly insufficient to satisfy AR requirements. To close this gap, we introduce LaMAR, a new benchmark with a comprehensive capture and GT pipeline that co-registers realistic trajectories and sensor streams captured by heterogeneous AR devices in large, unconstrained scenes. To establish an accurate GT, our pipeline robustly aligns the trajectories against laser scans in a fully automated manner. As a result, we publish a benchmark dataset of diverse and large-scale scenes recorded with head-mounted and hand-held AR devices. We extend several state-of-the-art methods to take advantage of the AR-specific setup and evaluate them on our benchmark. The results offer new insights on current research and reveal promising avenues for future work in the field of localization and mapping for AR.
定位和映射是增强现实(AR)的基础技术,它使数字内容能够在现实世界中共享和持久。虽然已经取得了重大进展,但研究人员仍然主要受到不切实际的基准的驱动,而不是代表现实世界的AR场景。这些基准通常基于低场景多样性的小规模数据集,从固定相机捕获,并且缺乏其他传感器输入,如惯性,无线电或深度数据。此外,它们的地基真值(GT)精度大多不足以满足AR的要求。为了缩小这一差距,我们引入了LaMAR,这是一种新的基准,具有全面的捕获和GT管道,可在大型无约束场景中共同注册异构AR设备捕获的真实轨迹和传感器流。为了建立精确的GT,我们的管道以全自动的方式将轨迹与激光扫描进行对齐。因此,我们发布了一个使用头戴式和手持AR设备记录的各种大规模场景的基准数据集。我们扩展了几种最先进的方法,以利用特定于ar的设置,并在基准测试中对它们进行评估。这些结果为当前的研究提供了新的见解,并为AR的定位和地图绘制领域的未来工作揭示了有希望的途径。
{"title":"LaMAR: Benchmarking Localization and Mapping for Augmented Reality","authors":"Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L. Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, O. Mikšík, M. Pollefeys","doi":"10.48550/arXiv.2210.10770","DOIUrl":"https://doi.org/10.48550/arXiv.2210.10770","url":null,"abstract":"Localization and mapping is the foundational technology for augmented reality (AR) that enables sharing and persistence of digital content in the real world. While significant progress has been made, researchers are still mostly driven by unrealistic benchmarks not representative of real-world AR scenarios. These benchmarks are often based on small-scale datasets with low scene diversity, captured from stationary cameras, and lack other sensor inputs like inertial, radio, or depth data. Furthermore, their ground-truth (GT) accuracy is mostly insufficient to satisfy AR requirements. To close this gap, we introduce LaMAR, a new benchmark with a comprehensive capture and GT pipeline that co-registers realistic trajectories and sensor streams captured by heterogeneous AR devices in large, unconstrained scenes. To establish an accurate GT, our pipeline robustly aligns the trajectories against laser scans in a fully automated manner. As a result, we publish a benchmark dataset of diverse and large-scale scenes recorded with head-mounted and hand-held AR devices. We extend several state-of-the-art methods to take advantage of the AR-specific setup and evaluate them on our benchmark. The results offer new insights on current research and reveal promising avenues for future work in the field of localization and mapping for AR.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"7 1","pages":"686-704"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78596733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Attaining Class-level Forgetting in Pretrained Model using Few Samples 使用少量样本实现预训练模型的类级遗忘
Pravendra Singh, Pratik Mazumder, M. A. Karim
In order to address real-world problems, deep learning models are jointly trained on many classes. However, in the future, some classes may become restricted due to privacy/ethical concerns, and the restricted class knowledge has to be removed from the models that have been trained on them. The available data may also be limited due to privacy/ethical concerns, and re-training the model will not be possible. We propose a novel approach to address this problem without affecting the model's prediction power for the remaining classes. Our approach identifies the model parameters that are highly relevant to the restricted classes and removes the knowledge regarding the restricted classes from them using the limited available training data. Our approach is significantly faster and performs similar to the model re-trained on the complete data of the remaining classes.
为了解决现实世界的问题,深度学习模型在许多类上进行联合训练。然而,在未来,由于隐私/道德问题,一些类可能会受到限制,并且受限制的类知识必须从已对其进行培训的模型中删除。由于隐私/道德问题,可用的数据也可能有限,并且不可能重新训练模型。我们提出了一种新的方法来解决这个问题,而不影响模型对其余类别的预测能力。我们的方法识别与受限类高度相关的模型参数,并使用有限的可用训练数据从它们中删除有关受限类的知识。我们的方法明显更快,并且执行类似于在剩余类的完整数据上重新训练的模型。
{"title":"Attaining Class-level Forgetting in Pretrained Model using Few Samples","authors":"Pravendra Singh, Pratik Mazumder, M. A. Karim","doi":"10.48550/arXiv.2210.10670","DOIUrl":"https://doi.org/10.48550/arXiv.2210.10670","url":null,"abstract":"In order to address real-world problems, deep learning models are jointly trained on many classes. However, in the future, some classes may become restricted due to privacy/ethical concerns, and the restricted class knowledge has to be removed from the models that have been trained on them. The available data may also be limited due to privacy/ethical concerns, and re-training the model will not be possible. We propose a novel approach to address this problem without affecting the model's prediction power for the remaining classes. Our approach identifies the model parameters that are highly relevant to the restricted classes and removes the knowledge regarding the restricted classes from them using the limited available training data. Our approach is significantly faster and performs similar to the model re-trained on the complete data of the remaining classes.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"46 1","pages":"433-448"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89206563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scaling Adversarial Training to Large Perturbation Bounds 将对抗训练扩展到大扰动界
Sravanti Addepalli, Samyak Jain, Gaurang Sriramanan, R. Venkatesh Babu
The vulnerability of Deep Neural Networks to Adversarial Attacks has fuelled research towards building robust models. While most Adversarial Training algorithms aim at defending attacks constrained within low magnitude Lp norm bounds, real-world adversaries are not limited by such constraints. In this work, we aim to achieve adversarial robustness within larger bounds, against perturbations that may be perceptible, but do not change human (or Oracle) prediction. The presence of images that flip Oracle predictions and those that do not makes this a challenging setting for adversarial robustness. We discuss the ideal goals of an adversarial defense algorithm beyond perceptual limits, and further highlight the shortcomings of naively extending existing training algorithms to higher perturbation bounds. In order to overcome these shortcomings, we propose a novel defense, Oracle-Aligned Adversarial Training (OA-AT), to align the predictions of the network with that of an Oracle during adversarial training. The proposed approach achieves state-of-the-art performance at large epsilon bounds (such as an L-inf bound of 16/255 on CIFAR-10) while outperforming existing defenses (AWP, TRADES, PGD-AT) at standard bounds (8/255) as well.
深度神经网络对对抗性攻击的脆弱性推动了对构建鲁棒模型的研究。虽然大多数对抗性训练算法旨在防御受低幅度Lp范数约束的攻击,但现实世界中的对手并不受此类约束的限制。在这项工作中,我们的目标是在更大的范围内实现对抗性鲁棒性,以对抗可能可感知的扰动,但不会改变人类(或Oracle)的预测。图像的存在推翻了Oracle的预测,而那些没有推翻预测的图像,使得对抗性稳健性成为一个具有挑战性的设置。我们讨论了超越感知极限的对抗性防御算法的理想目标,并进一步强调了将现有训练算法天真地扩展到更高摄动界的缺点。为了克服这些缺点,我们提出了一种新的防御方法,Oracle- aligned Adversarial Training (OA-AT),在对抗训练期间使网络的预测与Oracle的预测保持一致。所提出的方法在大的epsilon边界(例如CIFAR-10上的16/255的L-inf边界)上实现了最先进的性能,同时在标准边界(8/255)上也优于现有的防御(AWP, TRADES, PGD-AT)。
{"title":"Scaling Adversarial Training to Large Perturbation Bounds","authors":"Sravanti Addepalli, Samyak Jain, Gaurang Sriramanan, R. Venkatesh Babu","doi":"10.48550/arXiv.2210.09852","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09852","url":null,"abstract":"The vulnerability of Deep Neural Networks to Adversarial Attacks has fuelled research towards building robust models. While most Adversarial Training algorithms aim at defending attacks constrained within low magnitude Lp norm bounds, real-world adversaries are not limited by such constraints. In this work, we aim to achieve adversarial robustness within larger bounds, against perturbations that may be perceptible, but do not change human (or Oracle) prediction. The presence of images that flip Oracle predictions and those that do not makes this a challenging setting for adversarial robustness. We discuss the ideal goals of an adversarial defense algorithm beyond perceptual limits, and further highlight the shortcomings of naively extending existing training algorithms to higher perturbation bounds. In order to overcome these shortcomings, we propose a novel defense, Oracle-Aligned Adversarial Training (OA-AT), to align the predictions of the network with that of an Oracle during adversarial training. The proposed approach achieves state-of-the-art performance at large epsilon bounds (such as an L-inf bound of 16/255 on CIFAR-10) while outperforming existing defenses (AWP, TRADES, PGD-AT) at standard bounds (8/255) as well.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"43 1","pages":"301-316"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85435781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ARAH: Animatable Volume Rendering of Articulated Human SDFs ARAH:铰接式人体sdf的可动画体渲染
Shaofei Wang, Katja Schwarz, Andreas Geiger, Siyu Tang
Combining human body models with differentiable rendering has recently enabled animatable avatars of clothed humans from sparse sets of multi-view RGB videos. While state-of-the-art approaches achieve realistic appearance with neural radiance fields (NeRF), the inferred geometry often lacks detail due to missing geometric constraints. Further, animating avatars in out-of-distribution poses is not yet possible because the mapping from observation space to canonical space does not generalize faithfully to unseen poses. In this work, we address these shortcomings and propose a model to create animatable clothed human avatars with detailed geometry that generalize well to out-of-distribution poses. To achieve detailed geometry, we combine an articulated implicit surface representation with volume rendering. For generalization, we propose a novel joint root-finding algorithm for simultaneous ray-surface intersection search and correspondence search. Our algorithm enables efficient point sampling and accurate point canonicalization while generalizing well to unseen poses. We demonstrate that our proposed pipeline can generate clothed avatars with high-quality pose-dependent geometry and appearance from a sparse set of multi-view RGB videos. Our method achieves state-of-the-art performance on geometry and appearance reconstruction while creating animatable avatars that generalize well to out-of-distribution poses beyond the small number of training poses.
将人体模型与可微分渲染相结合,最近可以从稀疏的多视图RGB视频集中生成穿着衣服的人的动画化身。虽然最先进的方法可以通过神经辐射场(NeRF)实现逼真的外观,但由于缺少几何约束,推断的几何形状往往缺乏细节。此外,由于从观察空间到规范空间的映射不能忠实地推广到看不见的姿势,因此在非分布姿势中动画化身尚不可能。在这项工作中,我们解决了这些缺点,并提出了一个模型来创建具有详细几何形状的可动画化的穿着的人类化身,该模型可以很好地推广到分布外的姿势。为了获得详细的几何图形,我们将铰接的隐式表面表示与体绘制结合起来。为了推广,我们提出了一种同时进行射线面相交搜索和对应搜索的联合寻根算法。我们的算法能够实现高效的点采样和精确的点规范化,同时很好地推广到看不见的姿势。我们证明了我们提出的管道可以从一组稀疏的多视图RGB视频中生成具有高质量姿态相关几何形状和外观的穿着头像。我们的方法在几何和外观重建方面实现了最先进的性能,同时创建了可动画的化身,这些化身可以很好地推广到超出少量训练姿势的非分布姿势。
{"title":"ARAH: Animatable Volume Rendering of Articulated Human SDFs","authors":"Shaofei Wang, Katja Schwarz, Andreas Geiger, Siyu Tang","doi":"10.48550/arXiv.2210.10036","DOIUrl":"https://doi.org/10.48550/arXiv.2210.10036","url":null,"abstract":"Combining human body models with differentiable rendering has recently enabled animatable avatars of clothed humans from sparse sets of multi-view RGB videos. While state-of-the-art approaches achieve realistic appearance with neural radiance fields (NeRF), the inferred geometry often lacks detail due to missing geometric constraints. Further, animating avatars in out-of-distribution poses is not yet possible because the mapping from observation space to canonical space does not generalize faithfully to unseen poses. In this work, we address these shortcomings and propose a model to create animatable clothed human avatars with detailed geometry that generalize well to out-of-distribution poses. To achieve detailed geometry, we combine an articulated implicit surface representation with volume rendering. For generalization, we propose a novel joint root-finding algorithm for simultaneous ray-surface intersection search and correspondence search. Our algorithm enables efficient point sampling and accurate point canonicalization while generalizing well to unseen poses. We demonstrate that our proposed pipeline can generate clothed avatars with high-quality pose-dependent geometry and appearance from a sparse set of multi-view RGB videos. Our method achieves state-of-the-art performance on geometry and appearance reconstruction while creating animatable avatars that generalize well to out-of-distribution poses beyond the small number of training poses.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"35 1","pages":"1-19"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86554090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection 三维目标检测的同质多模态特征融合与交互
Xin Li, Botian Shi, Yuenan Hou, Xingjiao Wu, Tianlong Ma, Yikang Li, Liangbo He
Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.
多模态三维目标检测一直是自动驾驶领域的研究热点。然而,探索稀疏的3D点与密集的2D像素之间的跨模态特征融合并非易事。最近的方法要么将图像特征与投影到二维图像平面上的点云特征融合,要么将稀疏的点云与密集的图像像素结合起来。这些融合方法经常遭受严重的信息丢失,从而导致次优性能。为了解决这些问题,我们在点云和图像之间构建同质结构,通过将相机特征转换到LiDAR三维空间来避免投影信息的丢失。本文提出了一种均匀多模态特征融合与交互方法(HMFI)用于三维目标检测。具体来说,我们首先设计了一个图像体素提升模块(IVLM),将二维图像特征提升到三维空间,并生成均匀的图像体素特征。然后,通过引入基于自关注的查询融合机制(QFM),将体素化的点云特征与来自不同区域的图像特征进行融合;接下来,我们提出了一个体素特征交互模块(VFIM)来增强同质点云和图像体素表示中相同对象语义信息的一致性,可以为跨模态特征融合提供对象级对齐指导,增强复杂背景下的判别能力。我们在KITTI和Waymo开放数据集上进行了大量的实验,与最先进的多模态方法相比,所提出的HMFI取得了更好的性能。特别是在KITTI基准上对自行车手的三维检测,HMFI大大超过了所有已发表的算法。
{"title":"Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection","authors":"Xin Li, Botian Shi, Yuenan Hou, Xingjiao Wu, Tianlong Ma, Yikang Li, Liangbo He","doi":"10.48550/arXiv.2210.09615","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09615","url":null,"abstract":"Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"382 1","pages":"691-707"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84958408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Towards Efficient and Effective Self-Supervised Learning of Visual Representations 视觉表征的高效和有效的自监督学习
Sravanti Addepalli, K. Bhogale, P. Dey, R. Venkatesh Babu
Self-supervision has emerged as a propitious method for visual representation learning after the recent paradigm shift from handcrafted pretext tasks to instance-similarity based approaches. Most state-of-the-art methods enforce similarity between various augmentations of a given image, while some methods additionally use contrastive approaches to explicitly ensure diverse representations. While these approaches have indeed shown promising direction, they require a significantly larger number of training iterations when compared to the supervised counterparts. In this work, we explore reasons for the slow convergence of these methods, and further propose to strengthen them using well-posed auxiliary tasks that converge significantly faster, and are also useful for representation learning. The proposed method utilizes the task of rotation prediction to improve the efficiency of existing state-of-the-art methods. We demonstrate significant gains in performance using the proposed method on multiple datasets, specifically for lower training epochs.
在最近从手工制作的借口任务到基于实例相似性的方法的范式转变之后,自我监督已经成为视觉表征学习的一种有利方法。大多数最先进的方法在给定图像的各种增强之间强制相似性,而一些方法另外使用对比方法来明确地确保不同的表示。虽然这些方法确实显示出了有希望的方向,但是与有监督的方法相比,它们需要大量的训练迭代。在这项工作中,我们探讨了这些方法收敛缓慢的原因,并进一步提出使用良好定位的辅助任务来加强它们,这些任务的收敛速度要快得多,并且对表示学习也很有用。该方法利用旋转预测任务来提高现有最先进方法的效率。我们证明了在多个数据集上使用所提出的方法在性能上的显着提高,特别是对于较低的训练周期。
{"title":"Towards Efficient and Effective Self-Supervised Learning of Visual Representations","authors":"Sravanti Addepalli, K. Bhogale, P. Dey, R. Venkatesh Babu","doi":"10.48550/arXiv.2210.09866","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09866","url":null,"abstract":"Self-supervision has emerged as a propitious method for visual representation learning after the recent paradigm shift from handcrafted pretext tasks to instance-similarity based approaches. Most state-of-the-art methods enforce similarity between various augmentations of a given image, while some methods additionally use contrastive approaches to explicitly ensure diverse representations. While these approaches have indeed shown promising direction, they require a significantly larger number of training iterations when compared to the supervised counterparts. In this work, we explore reasons for the slow convergence of these methods, and further propose to strengthen them using well-posed auxiliary tasks that converge significantly faster, and are also useful for representation learning. The proposed method utilizes the task of rotation prediction to improve the efficiency of existing state-of-the-art methods. We demonstrate significant gains in performance using the proposed method on multiple datasets, specifically for lower training epochs.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"43 1","pages":"523-538"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85405996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1