首页 > 最新文献

2021 IEEE/CVF International Conference on Computer Vision (ICCV)最新文献

英文 中文
SurfGen: Adversarial 3D Shape Synthesis with Explicit Surface Discriminators SurfGen:具有显式表面鉴别器的对抗性三维形状综合
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01593
Andrew Luo, Tianqin Li, Wenhao Zhang, T. Lee
Recent advances in deep generative models have led to immense progress in 3D shape synthesis. While existing models are able to synthesize shapes represented as voxels, point-clouds, or implicit functions, these methods only indirectly enforce the plausibility of the final 3D shape surface. Here we present a 3D shape synthesis framework (SurfGen) that directly applies adversarial training to the object surface. Our approach uses a differentiable spherical projection layer to capture and represent the explicit zero isosurface of an implicit 3D generator as functions defined on the unit sphere. By processing the spherical representation of 3D object surfaces with a spherical CNN in an adversarial setting, our generator can better learn the statistics of natural shape surfaces. We evaluate our model on large-scale shape datasets, and demonstrate that the end-to-end trained model is capable of generating high fidelity 3D shapes with diverse topology.
深度生成模型的最新进展导致了三维形状合成的巨大进步。虽然现有的模型能够合成以体素、点云或隐式函数表示的形状,但这些方法只能间接地增强最终3D形状表面的可信性。在这里,我们提出了一个3D形状合成框架(SurfGen),它直接将对抗性训练应用于物体表面。我们的方法使用一个可微的球面投影层来捕获和表示隐式3D生成器的显式零等值面,作为单位球体上定义的函数。通过在对抗设置中使用球面CNN处理3D物体表面的球面表示,我们的生成器可以更好地学习自然形状表面的统计信息。我们在大规模形状数据集上评估了我们的模型,并证明了端到端训练模型能够生成具有不同拓扑结构的高保真度3D形状。
{"title":"SurfGen: Adversarial 3D Shape Synthesis with Explicit Surface Discriminators","authors":"Andrew Luo, Tianqin Li, Wenhao Zhang, T. Lee","doi":"10.1109/ICCV48922.2021.01593","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01593","url":null,"abstract":"Recent advances in deep generative models have led to immense progress in 3D shape synthesis. While existing models are able to synthesize shapes represented as voxels, point-clouds, or implicit functions, these methods only indirectly enforce the plausibility of the final 3D shape surface. Here we present a 3D shape synthesis framework (SurfGen) that directly applies adversarial training to the object surface. Our approach uses a differentiable spherical projection layer to capture and represent the explicit zero isosurface of an implicit 3D generator as functions defined on the unit sphere. By processing the spherical representation of 3D object surfaces with a spherical CNN in an adversarial setting, our generator can better learn the statistics of natural shape surfaces. We evaluate our model on large-scale shape datasets, and demonstrate that the end-to-end trained model is capable of generating high fidelity 3D shapes with diverse topology.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"6 1","pages":"16218-16228"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79148671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras N-ImageNet:用事件相机实现鲁棒、细粒度目标识别
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00215
Junho Kim, Jaehyeok Bae, Gang-Ryeong Park, Y. Kim
We introduce N-ImageNet, a large-scale dataset targeted for robust, fine-grained object recognition with event cameras. The dataset is collected using programmable hardware in which an event camera consistently moves around a monitor displaying images from ImageNet. N-ImageNet serves as a challenging benchmark for event-based object recognition, due to its large number of classes and samples. We empirically show that pretraining on N-ImageNet improves the performance of event-based classifiers and helps them learn with few labeled data. In addition, we present several variants of N-ImageNet to test the robustness of event-based classifiers under diverse camera trajectories and severe lighting conditions, and propose a novel event representation to alleviate the performance degradation. To the best of our knowledge, we are the first to quantitatively investigate the consequences caused by various environmental conditions on event-based object recognition algorithms. N-ImageNet and its variants are expected to guide practical implementations for deploying event-based object recognition algorithms in the real world.
我们引入N-ImageNet,这是一个大规模的数据集,旨在通过事件相机进行鲁棒、细粒度的物体识别。数据集是使用可编程硬件收集的,其中事件相机始终围绕显示来自ImageNet的图像的监视器移动。N-ImageNet是基于事件的对象识别的一个具有挑战性的基准,因为它有大量的类和样本。我们的经验表明,N-ImageNet上的预训练提高了基于事件的分类器的性能,并帮助它们在很少的标记数据下进行学习。此外,我们提出了N-ImageNet的几个变体来测试基于事件的分类器在不同相机轨迹和恶劣光照条件下的鲁棒性,并提出了一种新的事件表示来缓解性能下降。据我们所知,我们是第一个定量研究各种环境条件对基于事件的物体识别算法造成的后果的人。N-ImageNet及其变体有望指导在现实世界中部署基于事件的对象识别算法的实际实现。
{"title":"N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras","authors":"Junho Kim, Jaehyeok Bae, Gang-Ryeong Park, Y. Kim","doi":"10.1109/ICCV48922.2021.00215","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00215","url":null,"abstract":"We introduce N-ImageNet, a large-scale dataset targeted for robust, fine-grained object recognition with event cameras. The dataset is collected using programmable hardware in which an event camera consistently moves around a monitor displaying images from ImageNet. N-ImageNet serves as a challenging benchmark for event-based object recognition, due to its large number of classes and samples. We empirically show that pretraining on N-ImageNet improves the performance of event-based classifiers and helps them learn with few labeled data. In addition, we present several variants of N-ImageNet to test the robustness of event-based classifiers under diverse camera trajectories and severe lighting conditions, and propose a novel event representation to alleviate the performance degradation. To the best of our knowledge, we are the first to quantitatively investigate the consequences caused by various environmental conditions on event-based object recognition algorithms. N-ImageNet and its variants are expected to guide practical implementations for deploying event-based object recognition algorithms in the real world.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"18 1","pages":"2126-2136"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81588525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection 基础一致性:为精确的视觉关系检测提取空间常识
Pub Date : 2021-10-01 DOI: 10.1109/iccv48922.2021.01561
Markos Diomataris, N. Gkanatsios, Vassilis Pitsikalis, P. Maragos
Scene Graph Generators (SGGs) are models that, given an image, build a directed graph where each edge represents a predicted subject predicate object triplet. Most SGGs silently exploit datasets' bias on relationships' context, i.e. its subject and object, to improve recall and neglect spatial and visual evidence, e.g. having seen a glut of data for person wearing shirt, they are overconfident that every person is wearing every shirt. Such imprecise predictions are mainly ascribed to the lack of negative examples for most relationships, which obstructs models from meaningfully learning predicates, even those that have ample positive examples. We first present an indepth investigation of the context bias issue to showcase that all examined state-of-the-art SGGs share the above vulnerabilities. In response, we propose a semi-supervised scheme that forces predicted triplets to be grounded consistently back to the image, in a closed-loop manner. The developed spatial common sense can be then distilled to a student SGG and substantially enhance its spatial reasoning ability. This Grounding Consistency Distillation (GCD) approach is model-agnostic and benefits from the superfluous unlabeled samples to retain the valuable context information and avert memorization of annotations. Furthermore, we demonstrate that current metrics disregard unlabeled samples, rendering themselves incapable of reflecting context bias, then we mine and incorporate during evaluation hard-negatives to reformulate precision as a reliable metric. Extensive experimental comparisons exhibit large quantitative - up to 70% relative precision boost on VG200 dataset - and qualitative improvements to prove the significance of our GCD method and our metrics towards refocusing graph generation as a core aspect of scene understanding. Code available at https://github.com/deeplab-ai/grounding-consistent-vrd.
场景图生成器(SGGs)是一种模型,给定图像,构建一个有向图,其中每个边表示预测的主谓宾三元组。大多数sgg默默地利用数据集对关系上下文(即其主体和客体)的偏见来提高回忆,而忽略了空间和视觉证据,例如,在看到关于穿衬衫的人的大量数据后,他们过于自信,认为每个人都穿了每件衬衫。这种不精确的预测主要归因于大多数关系缺乏负面例子,这阻碍了模型有意义地学习谓词,即使是那些有充足正面例子的模型。我们首先对上下文偏差问题进行了深入调查,以展示所有经过检查的最先进的sgg都具有上述漏洞。作为回应,我们提出了一种半监督方案,迫使预测的三联体以闭环方式一致地接地回图像。发展的空间常识可以提炼到学生的SGG中,大大提高其空间推理能力。这种基础一致性蒸馏(GCD)方法是模型不可知的,并且受益于多余的未标记样本来保留有价值的上下文信息并避免注释的记忆。此外,我们证明了当前的指标忽略了未标记的样本,使它们无法反映上下文偏差,然后我们在评估硬否定时挖掘和合并,以将精度重新制定为可靠的指标。广泛的实验比较显示了大量的定量-在VG200数据集上高达70%的相对精度提升-和定性改进,以证明我们的GCD方法和我们的度量对重新聚焦图形生成作为场景理解的核心方面的重要性。代码可从https://github.com/deeplab-ai/grounding-consistent-vrd获得。
{"title":"Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection","authors":"Markos Diomataris, N. Gkanatsios, Vassilis Pitsikalis, P. Maragos","doi":"10.1109/iccv48922.2021.01561","DOIUrl":"https://doi.org/10.1109/iccv48922.2021.01561","url":null,"abstract":"Scene Graph Generators (SGGs) are models that, given an image, build a directed graph where each edge represents a predicted subject predicate object triplet. Most SGGs silently exploit datasets' bias on relationships' context, i.e. its subject and object, to improve recall and neglect spatial and visual evidence, e.g. having seen a glut of data for person wearing shirt, they are overconfident that every person is wearing every shirt. Such imprecise predictions are mainly ascribed to the lack of negative examples for most relationships, which obstructs models from meaningfully learning predicates, even those that have ample positive examples. We first present an indepth investigation of the context bias issue to showcase that all examined state-of-the-art SGGs share the above vulnerabilities. In response, we propose a semi-supervised scheme that forces predicted triplets to be grounded consistently back to the image, in a closed-loop manner. The developed spatial common sense can be then distilled to a student SGG and substantially enhance its spatial reasoning ability. This Grounding Consistency Distillation (GCD) approach is model-agnostic and benefits from the superfluous unlabeled samples to retain the valuable context information and avert memorization of annotations. Furthermore, we demonstrate that current metrics disregard unlabeled samples, rendering themselves incapable of reflecting context bias, then we mine and incorporate during evaluation hard-negatives to reformulate precision as a reliable metric. Extensive experimental comparisons exhibit large quantitative - up to 70% relative precision boost on VG200 dataset - and qualitative improvements to prove the significance of our GCD method and our metrics towards refocusing graph generation as a core aspect of scene understanding. Code available at https://github.com/deeplab-ai/grounding-consistent-vrd.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"25 1","pages":"15891-15900"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84836628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Learning Latent Architectural Distribution in Differentiable Neural Architecture Search via Variational Information Maximization 基于变分信息最大化的可微神经结构搜索中潜在结构分布的学习
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01209
Yaoming Wang, Yuchen Liu, Wenrui Dai, Chenglin Li, Junni Zou, H. Xiong
Existing differentiable neural architecture search approaches simply assume the architectural distribution on each edge is independent of each other, which conflicts with the intrinsic properties of architecture. In this paper, we view the architectural distribution as the latent representation of specific data points. Then we propose Variational Information Maximization Neural Architecture Search (VIM-NAS) to leverage a simple yet effective convolutional neural network to model the latent representation, and optimize for a tractable variational lower bound to the mutual information between the data points and the latent representations. VIM-NAS automatically learns a nearly one-hot distribution from a continuous distribution with extremely fast convergence speed, e.g., converging with one epoch. Experimental results demonstrate VIM-NAS achieves state-of-the-art performance on various search spaces, including DARTS search space, NAS-Bench-1shot1, NAS-Bench-201, and simplified search spaces S1-S4. Specifically, VIM-NAS achieves a top-1 error rate of 2.45% and 15.80% within 10 minutes on CIFAR-10 and CIFAR-100, respectively, and a top-1 error rate of 24.0% when transferred to ImageNet.
现有的可微神经结构搜索方法简单地假设每条边缘上的结构分布是相互独立的,这与结构的内在属性相冲突。在本文中,我们将架构分布视为特定数据点的潜在表示。然后,我们提出了变分信息最大化神经结构搜索(vims - nas),利用简单而有效的卷积神经网络来建模潜在表示,并优化数据点和潜在表示之间互信息的可处理变分下界。VIM-NAS从连续分布中自动学习到近一热分布,收敛速度极快,如一个epoch收敛。实验结果表明,VIM-NAS在各种搜索空间(包括DARTS搜索空间、NAS-Bench-1shot1、NAS-Bench-201和简化搜索空间S1-S4)上实现了最先进的性能。其中,VIM-NAS在CIFAR-10和CIFAR-100上的10分钟top-1错误率分别为2.45%和15.80%,传输到ImageNet时的top-1错误率为24.0%。
{"title":"Learning Latent Architectural Distribution in Differentiable Neural Architecture Search via Variational Information Maximization","authors":"Yaoming Wang, Yuchen Liu, Wenrui Dai, Chenglin Li, Junni Zou, H. Xiong","doi":"10.1109/ICCV48922.2021.01209","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01209","url":null,"abstract":"Existing differentiable neural architecture search approaches simply assume the architectural distribution on each edge is independent of each other, which conflicts with the intrinsic properties of architecture. In this paper, we view the architectural distribution as the latent representation of specific data points. Then we propose Variational Information Maximization Neural Architecture Search (VIM-NAS) to leverage a simple yet effective convolutional neural network to model the latent representation, and optimize for a tractable variational lower bound to the mutual information between the data points and the latent representations. VIM-NAS automatically learns a nearly one-hot distribution from a continuous distribution with extremely fast convergence speed, e.g., converging with one epoch. Experimental results demonstrate VIM-NAS achieves state-of-the-art performance on various search spaces, including DARTS search space, NAS-Bench-1shot1, NAS-Bench-201, and simplified search spaces S1-S4. Specifically, VIM-NAS achieves a top-1 error rate of 2.45% and 15.80% within 10 minutes on CIFAR-10 and CIFAR-100, respectively, and a top-1 error rate of 24.0% when transferred to ImageNet.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"16 1","pages":"12292-12301"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84944104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Visual Graph Memory with Unsupervised Representation for Visual Navigation 视觉导航的无监督表示视觉图记忆
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01559
Obin Kwon, Nuri Kim, Yunho Choi, Hwiyeon Yoo, Jeongho Park, Songhwai Oh
We present a novel graph-structured memory for visual navigation, called visual graph memory (VGM), which consists of unsupervised image representations obtained from navigation history. The proposed VGM is constructed incrementally based on the similarities among the unsupervised representations of observed images, and these representations are learned from an unlabeled image dataset. We also propose a navigation framework that can utilize the proposed VGM to tackle visual navigation problems. By incorporating a graph convolutional network and the attention mechanism, the proposed agent refers to the VGM to navigate the environment while simultaneously building the VGM. Using the VGM, the agent can embed its navigation history and other useful task-related information. We validate our approach on the visual navigation tasks using the Habitat simulator with the Gibson dataset, which provides a photo-realistic simulation environment. The extensive experimental results show that the proposed navigation agent with VGM surpasses the state-of-the-art approaches on image-goal navigation tasks. Project Page: https://sites.google.com/view/iccv2021vgm
我们提出了一种新的用于视觉导航的图结构记忆,称为视觉图记忆(VGM),它由从导航历史中获得的无监督图像表示组成。所提出的VGM是基于观察图像的无监督表示之间的相似性增量构建的,这些表示是从未标记的图像数据集中学习的。我们还提出了一个导航框架,可以利用所提出的VGM来解决视觉导航问题。通过结合图卷积网络和注意机制,所提出的智能体在构建VGM的同时引用VGM来导航环境。使用VGM,代理可以嵌入其导航历史和其他有用的任务相关信息。我们使用Habitat模拟器和Gibson数据集验证了我们的视觉导航任务方法,该模拟器提供了一个逼真的模拟环境。大量的实验结果表明,基于VGM的导航代理在图像目标导航任务上优于目前最先进的方法。项目页面:https://sites.google.com/view/iccv2021vgm
{"title":"Visual Graph Memory with Unsupervised Representation for Visual Navigation","authors":"Obin Kwon, Nuri Kim, Yunho Choi, Hwiyeon Yoo, Jeongho Park, Songhwai Oh","doi":"10.1109/ICCV48922.2021.01559","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01559","url":null,"abstract":"We present a novel graph-structured memory for visual navigation, called visual graph memory (VGM), which consists of unsupervised image representations obtained from navigation history. The proposed VGM is constructed incrementally based on the similarities among the unsupervised representations of observed images, and these representations are learned from an unlabeled image dataset. We also propose a navigation framework that can utilize the proposed VGM to tackle visual navigation problems. By incorporating a graph convolutional network and the attention mechanism, the proposed agent refers to the VGM to navigate the environment while simultaneously building the VGM. Using the VGM, the agent can embed its navigation history and other useful task-related information. We validate our approach on the visual navigation tasks using the Habitat simulator with the Gibson dataset, which provides a photo-realistic simulation environment. The extensive experimental results show that the proposed navigation agent with VGM surpasses the state-of-the-art approaches on image-goal navigation tasks. Project Page: https://sites.google.com/view/iccv2021vgm","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"81 1","pages":"15870-15879"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76852571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
BuildingNet: Learning to Label 3D Buildings BuildingNet:学习标记3D建筑
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01023
Pratheba Selvaraju, Mohamed Nabail, Marios Loizou, Maria I. Maslioukova, Melinos Averkiou, Andreas C. Andreou, S. Chaudhuri, E. Kalogerakis
We introduce BuildingNet: (a) a large-scale dataset of 3D building models whose exteriors are consistently labeled, and (b) a graph neural network that labels building meshes by analyzing spatial and structural relations of their geometric primitives. To create our dataset, we used crowdsourcing combined with expert guidance, resulting in 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models. The dataset covers several building categories, such as houses, churches, skyscrapers, town halls, libraries, and castles. We include a benchmark for evaluating mesh and point cloud labeling. Buildings have more challenging structural complexity compared to objects in existing benchmarks (e.g., ShapeNet, PartNet), thus, we hope that our dataset can nurture the development of algorithms that are able to cope with such large-scale geometric data for both vision and graphics tasks e.g., 3D semantic segmentation, part-based generative models, correspondences, texturing, and analysis of point cloud data acquired from real-world buildings. Finally, we show that our mesh-based graph neural network significantly improves performance over several baselines for labeling 3D meshes. Our project page www.buildingnet.org includes our dataset and code.
我们介绍了BuildingNet:(a)一个大规模的三维建筑模型数据集,其外部被一致地标记;(b)一个通过分析其几何基元的空间和结构关系来标记建筑网格的图神经网络。为了创建我们的数据集,我们使用了众包和专家指导相结合的方法,得到了513K个带注释的网格原语,在2K个建筑模型中分为292K个语义部分组件。该数据集涵盖了几个建筑类别,如房屋、教堂、摩天大楼、市政厅、图书馆和城堡。我们包含了一个评估网格和点云标记的基准。与现有基准(例如ShapeNet, PartNet)中的对象相比,建筑物具有更具挑战性的结构复杂性,因此,我们希望我们的数据集能够促进算法的发展,这些算法能够处理视觉和图形任务中的大规模几何数据,例如3D语义分割,基于零件的生成模型,对应,纹理和从现实世界中获取的点云数据的分析。最后,我们证明了我们基于网格的图神经网络在标记3D网格的几个基线上显着提高了性能。我们的项目页面www.buildingnet.org包含我们的数据集和代码。
{"title":"BuildingNet: Learning to Label 3D Buildings","authors":"Pratheba Selvaraju, Mohamed Nabail, Marios Loizou, Maria I. Maslioukova, Melinos Averkiou, Andreas C. Andreou, S. Chaudhuri, E. Kalogerakis","doi":"10.1109/ICCV48922.2021.01023","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01023","url":null,"abstract":"We introduce BuildingNet: (a) a large-scale dataset of 3D building models whose exteriors are consistently labeled, and (b) a graph neural network that labels building meshes by analyzing spatial and structural relations of their geometric primitives. To create our dataset, we used crowdsourcing combined with expert guidance, resulting in 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models. The dataset covers several building categories, such as houses, churches, skyscrapers, town halls, libraries, and castles. We include a benchmark for evaluating mesh and point cloud labeling. Buildings have more challenging structural complexity compared to objects in existing benchmarks (e.g., ShapeNet, PartNet), thus, we hope that our dataset can nurture the development of algorithms that are able to cope with such large-scale geometric data for both vision and graphics tasks e.g., 3D semantic segmentation, part-based generative models, correspondences, texturing, and analysis of point cloud data acquired from real-world buildings. Finally, we show that our mesh-based graph neural network significantly improves performance over several baselines for labeling 3D meshes. Our project page www.buildingnet.org includes our dataset and code.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"12 1","pages":"10377-10387"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77017677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Point Cloud Augmentation with Weighted Local Transformations 加权局部变换的点云增强
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00059
S. Kim, S. Lee, Dasol Hwang, Jaewon Lee, Seong Jae Hwang, Hyunwoo J. Kim
Despite the extensive usage of point clouds in 3D vision, relatively limited data are available for training deep neural networks. Although data augmentation is a standard approach to compensate for the scarcity of data, it has been less explored in the point cloud literature. In this paper, we propose a simple and effective augmentation method called PointWOLF for point cloud augmentation. The proposed method produces smoothly varying non-rigid deformations by locally weighted transformations centered at multiple anchor points. The smooth deformations allow diverse and realistic augmentations. Furthermore, in order to minimize the manual efforts to search the optimal hyperparameters for augmentation, we present AugTune, which generates augmented samples of desired difficulties producing targeted confidence scores. Our experiments show our framework consistently improves the performance for both shape classification and part segmentation tasks. Particularly, with PointNet++, PointWOLF achieves the state-of-the-art 89.7 accuracy on shape classification with the real-world ScanObjectNN dataset. The code is available at https://github.com/mlvlab/PointWOLF.
尽管点云在3D视觉中得到了广泛的应用,但用于训练深度神经网络的数据相对有限。虽然数据增强是一种补偿数据稀缺性的标准方法,但在点云文献中对其探索较少。本文提出了一种简单有效的点云增强方法PointWOLF。该方法通过以多个锚点为中心的局部加权变换产生平滑变化的非刚性变形。平滑的变形允许多样化和现实的增强。此外,为了最大限度地减少人工搜索最优超参数进行增强的工作量,我们提出了AugTune,它生成所需难度的增强样本,产生目标置信度分数。我们的实验表明,我们的框架在形状分类和零件分割任务上都能持续提高性能。特别是,使用PointNet++, PointWOLF在使用真实的ScanObjectNN数据集进行形状分类时达到了最先进的89.7精度。代码可在https://github.com/mlvlab/PointWOLF上获得。
{"title":"Point Cloud Augmentation with Weighted Local Transformations","authors":"S. Kim, S. Lee, Dasol Hwang, Jaewon Lee, Seong Jae Hwang, Hyunwoo J. Kim","doi":"10.1109/ICCV48922.2021.00059","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00059","url":null,"abstract":"Despite the extensive usage of point clouds in 3D vision, relatively limited data are available for training deep neural networks. Although data augmentation is a standard approach to compensate for the scarcity of data, it has been less explored in the point cloud literature. In this paper, we propose a simple and effective augmentation method called PointWOLF for point cloud augmentation. The proposed method produces smoothly varying non-rigid deformations by locally weighted transformations centered at multiple anchor points. The smooth deformations allow diverse and realistic augmentations. Furthermore, in order to minimize the manual efforts to search the optimal hyperparameters for augmentation, we present AugTune, which generates augmented samples of desired difficulties producing targeted confidence scores. Our experiments show our framework consistently improves the performance for both shape classification and part segmentation tasks. Particularly, with PointNet++, PointWOLF achieves the state-of-the-art 89.7 accuracy on shape classification with the real-world ScanObjectNN dataset. The code is available at https://github.com/mlvlab/PointWOLF.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"8 1","pages":"528-537"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81138025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Dynamic Cross Feature Fusion for Remote Sensing Pansharpening 遥感泛锐化的动态交叉特征融合
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01442
Xiao Wu, Tingzhu Huang, Liang-Jian Deng, Tian-Jing Zhang
Deep Convolution Neural Networks have been adopted for pansharpening and achieved state-of-the-art performance. However, most of the existing works mainly focus on single-scale feature fusion, which leads to failure in fully considering relationships of information between high-level semantics and low-level features, despite the network is deep enough. In this paper, we propose a dynamic cross feature fusion network (DCFNet) for pansharpening. Specifically, DCFNet contains multiple parallel branches, including a high-resolution branch served as the backbone, and the low-resolution branches progressively supplemented into the backbone. Thus our DCFNet can represent the overall information well. In order to enhance the relationships of inter-branches, dynamic cross feature transfers are embedded into multiple branches to obtain high-resolution representations. Then contextualized features will be learned to improve the fusion of information. Experimental results indicate that DCFNet significantly outperforms the prior arts in both quantitative indicators and visual qualities.
深度卷积神经网络已被用于泛锐化,并取得了最先进的性能。然而,现有的大部分工作主要集中在单尺度特征融合上,尽管网络足够深度,但未能充分考虑高层语义与低层特征之间的信息关系。本文提出了一种用于泛锐化的动态交叉特征融合网络(DCFNet)。具体来说,DCFNet包含多个并行分支,其中一个高分辨率分支作为主干,低分辨率分支逐步补充到主干。因此,我们的DCFNet可以很好地表示整体信息。为了增强分支间的关系,将动态交叉特征转移嵌入到多个分支中以获得高分辨率表示。然后学习情境化特征,提高信息的融合。实验结果表明,DCFNet在定量指标和视觉质量上都明显优于现有技术。
{"title":"Dynamic Cross Feature Fusion for Remote Sensing Pansharpening","authors":"Xiao Wu, Tingzhu Huang, Liang-Jian Deng, Tian-Jing Zhang","doi":"10.1109/ICCV48922.2021.01442","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01442","url":null,"abstract":"Deep Convolution Neural Networks have been adopted for pansharpening and achieved state-of-the-art performance. However, most of the existing works mainly focus on single-scale feature fusion, which leads to failure in fully considering relationships of information between high-level semantics and low-level features, despite the network is deep enough. In this paper, we propose a dynamic cross feature fusion network (DCFNet) for pansharpening. Specifically, DCFNet contains multiple parallel branches, including a high-resolution branch served as the backbone, and the low-resolution branches progressively supplemented into the backbone. Thus our DCFNet can represent the overall information well. In order to enhance the relationships of inter-branches, dynamic cross feature transfers are embedded into multiple branches to obtain high-resolution representations. Then contextualized features will be learned to improve the fusion of information. Experimental results indicate that DCFNet significantly outperforms the prior arts in both quantitative indicators and visual qualities.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"34 1","pages":"14667-14676"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82985082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Bringing Events into Video Deblurring with Non-consecutively Blurry Frames 用非连续模糊帧将事件带入视频去模糊
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00449
Wei Shang, Dongwei Ren, Dongqing Zou, Jimmy S. J. Ren, Ping Luo, W. Zuo
Recently, video deblurring has attracted considerable research attention, and several works suggest that events at high time rate can benefit deblurring. Existing video deblurring methods assume consecutively blurry frames, while neglecting the fact that sharp frames usually appear nearby blurry frame. In this paper, we develop a principled framework D2Nets for video deblurring to exploit non-consecutively blurry frames, and propose a flexible event fusion module (EFM) to bridge the gap between event-driven and video deblurring. In D2Nets, we propose to first detect nearest sharp frames (NSFs) using a bidirectional LST-M detector, and then perform deblurring guided by NSFs. Furthermore, the proposed EFM is flexible to be incorporated into D2Nets, in which events can be leveraged to notably boost the deblurring performance. EFM can also be easily incorporated into existing deblurring networks, making event-driven deblurring task benefit from state-of-the-art deblurring methods. On synthetic and real-world blurry datasets, our methods achieve better results than competing methods, and EFM not only benefits D2Nets but also significantly improves the competing deblurring networks.
近年来,视频去模糊引起了相当多的研究关注,一些研究表明,高时间率的事件可以受益于去模糊。现有的视频去模糊方法假设连续模糊帧,而忽略了清晰帧通常出现在模糊帧附近的事实。在本文中,我们开发了一个原则性框架D2Nets用于视频去模糊,以利用非连续模糊帧,并提出了一个灵活的事件融合模块(EFM)来弥合事件驱动和视频去模糊之间的差距。在D2Nets中,我们建议首先使用双向LST-M检测器检测最近的锐帧(nsf),然后由nsf引导进行去模糊。此外,所提出的EFM可以灵活地集成到D2Nets中,其中可以利用事件来显着提高去模糊性能。EFM还可以很容易地整合到现有的去模糊网络中,使事件驱动的去模糊任务受益于最先进的去模糊方法。在合成和真实世界的模糊数据集上,我们的方法取得了比竞争方法更好的结果,EFM不仅有利于D2Nets,而且显著改善了竞争去模糊网络。
{"title":"Bringing Events into Video Deblurring with Non-consecutively Blurry Frames","authors":"Wei Shang, Dongwei Ren, Dongqing Zou, Jimmy S. J. Ren, Ping Luo, W. Zuo","doi":"10.1109/ICCV48922.2021.00449","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00449","url":null,"abstract":"Recently, video deblurring has attracted considerable research attention, and several works suggest that events at high time rate can benefit deblurring. Existing video deblurring methods assume consecutively blurry frames, while neglecting the fact that sharp frames usually appear nearby blurry frame. In this paper, we develop a principled framework D2Nets for video deblurring to exploit non-consecutively blurry frames, and propose a flexible event fusion module (EFM) to bridge the gap between event-driven and video deblurring. In D2Nets, we propose to first detect nearest sharp frames (NSFs) using a bidirectional LST-M detector, and then perform deblurring guided by NSFs. Furthermore, the proposed EFM is flexible to be incorporated into D2Nets, in which events can be leveraged to notably boost the deblurring performance. EFM can also be easily incorporated into existing deblurring networks, making event-driven deblurring task benefit from state-of-the-art deblurring methods. On synthetic and real-world blurry datasets, our methods achieve better results than competing methods, and EFM not only benefits D2Nets but also significantly improves the competing deblurring networks.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"8 1","pages":"4511-4520"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85692832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild 基于层次运动概率分布的野外三维人体形状和姿态估计
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01103
A. Sengupta, Ignas Budvytis, R. Cipolla
This paper addresses the problem of 3D human body shape and pose estimation from an RGB image. This is often an ill-posed problem, since multiple plausible 3D bodies may match the visual evidence present in the input - particularly when the subject is occluded. Thus, it is desirable to estimate a distribution over 3D body shape and pose conditioned on the input image instead of a single 3D re-construction. We train a deep neural network to estimate a hierarchical matrix-Fisher distribution over relative 3D joint rotation matrices (i.e. body pose), which exploits the human body’s kinematic tree structure, as well as a Gaussian distribution over SMPL body shape parameters. To further ensure that the predicted shape and pose distributions match the visual evidence in the input image, we implement a differentiable rejection sampler to impose a reprojection loss between ground-truth 2D joint coordinates and samples from the predicted distributions, projected onto the image plane. We show that our method is competitive with the state-of-the-art in terms of 3D shape and pose metrics on the SSP-3D and 3DPW datasets, while also yielding a structured probability distribution over 3D body shape and pose, with which we can meaningfully quantify prediction uncertainty and sample multiple plausible 3D reconstructions to explain a given input image.
本文研究了基于RGB图像的三维人体形状和姿态估计问题。这通常是一个不适定的问题,因为多个看似合理的3D物体可能与输入中的视觉证据相匹配——特别是当主体被遮挡时。因此,期望在输入图像的条件下估计三维身体形状和姿势的分布,而不是单一的三维重建。我们训练了一个深度神经网络来估计相对3D关节旋转矩阵(即身体姿势)上的分层矩阵- fisher分布,该分布利用了人体的运动学树结构以及SMPL身体形状参数上的高斯分布。为了进一步确保预测的形状和姿态分布与输入图像中的视觉证据相匹配,我们实现了一个可微抑制采样器,以在真实2D关节坐标和预测分布的样本之间施加重投影损失,投影到图像平面上。我们表明,我们的方法在SSP-3D和3DPW数据集上的3D形状和姿态指标方面与最先进的方法具有竞争力,同时也产生了3D形状和姿态的结构化概率分布,我们可以有意义地量化预测不确定性,并对多个合理的3D重建进行采样,以解释给定的输入图像。
{"title":"Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild","authors":"A. Sengupta, Ignas Budvytis, R. Cipolla","doi":"10.1109/ICCV48922.2021.01103","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01103","url":null,"abstract":"This paper addresses the problem of 3D human body shape and pose estimation from an RGB image. This is often an ill-posed problem, since multiple plausible 3D bodies may match the visual evidence present in the input - particularly when the subject is occluded. Thus, it is desirable to estimate a distribution over 3D body shape and pose conditioned on the input image instead of a single 3D re-construction. We train a deep neural network to estimate a hierarchical matrix-Fisher distribution over relative 3D joint rotation matrices (i.e. body pose), which exploits the human body’s kinematic tree structure, as well as a Gaussian distribution over SMPL body shape parameters. To further ensure that the predicted shape and pose distributions match the visual evidence in the input image, we implement a differentiable rejection sampler to impose a reprojection loss between ground-truth 2D joint coordinates and samples from the predicted distributions, projected onto the image plane. We show that our method is competitive with the state-of-the-art in terms of 3D shape and pose metrics on the SSP-3D and 3DPW datasets, while also yielding a structured probability distribution over 3D body shape and pose, with which we can meaningfully quantify prediction uncertainty and sample multiple plausible 3D reconstructions to explain a given input image.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"62 1","pages":"11199-11209"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77919925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1