首页 > 最新文献

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文 中文
Exploit Visual Dependency Relations for Semantic Segmentation 利用视觉依赖关系进行语义分割
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00960
Mingyuan Liu, D. Schonfeld, Wei Tang
Dependency relations among visual entities are ubiquity because both objects and scenes are highly structured. They provide prior knowledge about the real world that can help improve the generalization ability of deep learning approaches. Different from contextual reasoning which focuses on feature aggregation in the spatial domain, visual dependency reasoning explicitly models the dependency relations among visual entities. In this paper, we introduce a novel network architecture, termed the dependency network or DependencyNet, for semantic segmentation. It unifies dependency reasoning at three semantic levels. Intra-class reasoning decouples the representations of different object categories and updates them separately based on the internal object structures. Inter-class reasoning then performs spatial and semantic reasoning based on the dependency relations among different object categories. We will have an in-depth investigation on how to discover the dependency graph from the training annotations. Global dependency reasoning further refines the representations of each object category based on the global scene information. Extensive ablative studies with a controlled model size and the same network depth show that each individual dependency reasoning component benefits semantic segmentation and they together significantly improve the base network. Experimental results on two benchmark datasets show the DependencyNet achieves comparable performance to the recent states of the art.
视觉实体之间的依赖关系无处不在,因为对象和场景都是高度结构化的。它们提供了关于现实世界的先验知识,有助于提高深度学习方法的泛化能力。与上下文推理侧重于空间域的特征聚合不同,视觉依赖推理明确地对视觉实体之间的依赖关系进行建模。在本文中,我们介绍了一种新的网络结构,称为依赖网络或DependencyNet,用于语义分割。它在三个语义层次上统一了依赖推理。类内推理将不同对象类别的表示解耦,并根据内部对象结构分别更新它们。类间推理则根据不同对象类别之间的依赖关系进行空间推理和语义推理。我们将深入研究如何从训练注释中发现依赖图。全局依赖推理基于全局场景信息进一步细化每个对象类别的表示。在控制模型大小和相同网络深度的条件下进行的大量烧蚀研究表明,每个独立的依赖推理组件都有利于语义分割,它们共同显著改善了基础网络。在两个基准数据集上的实验结果表明,DependencyNet达到了与最新技术相当的性能。
{"title":"Exploit Visual Dependency Relations for Semantic Segmentation","authors":"Mingyuan Liu, D. Schonfeld, Wei Tang","doi":"10.1109/CVPR46437.2021.00960","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00960","url":null,"abstract":"Dependency relations among visual entities are ubiquity because both objects and scenes are highly structured. They provide prior knowledge about the real world that can help improve the generalization ability of deep learning approaches. Different from contextual reasoning which focuses on feature aggregation in the spatial domain, visual dependency reasoning explicitly models the dependency relations among visual entities. In this paper, we introduce a novel network architecture, termed the dependency network or DependencyNet, for semantic segmentation. It unifies dependency reasoning at three semantic levels. Intra-class reasoning decouples the representations of different object categories and updates them separately based on the internal object structures. Inter-class reasoning then performs spatial and semantic reasoning based on the dependency relations among different object categories. We will have an in-depth investigation on how to discover the dependency graph from the training annotations. Global dependency reasoning further refines the representations of each object category based on the global scene information. Extensive ablative studies with a controlled model size and the same network depth show that each individual dependency reasoning component benefits semantic segmentation and they together significantly improve the base network. Experimental results on two benchmark datasets show the DependencyNet achieves comparable performance to the recent states of the art.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123719226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Real-Time Sphere Sweeping Stereo from Multiview Fisheye Images 实时球体扫描立体从多视图鱼眼图像
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01126
Andreas Meuleman, Hyeonjoong Jang, D. S. Jeon, Min H. Kim
A set of cameras with fisheye lenses have been used to capture a wide field of view. The traditional scan-line stereo algorithms based on epipolar geometry are directly inapplicable to this non-pinhole camera setup due to optical characteristics of fisheye lenses; hence, existing complete 360° RGB-D imaging systems have rarely achieved realtime performance yet. In this paper, we introduce an efficient sphere-sweeping stereo that can run directly on multiview fisheye images without requiring additional spherical rectification. Our main contributions are: First, we introduce an adaptive spherical matching method that accounts for each input fisheye camera’s resolving power concerning spherical distortion. Second, we propose a fast inter-scale bilateral cost volume filtering method that refines distance in noisy and textureless regions with optimal complexity of O(n). It enables real-time dense distance estimation while preserving edges. Lastly, the fisheye color and distance images are seamlessly combined into a complete 360° RGB-D image via fast inpainting of the dense distance map. We demonstrate an embedded 360° RGB-D imaging prototype composed of a mobile GPU and four fisheye cameras. Our prototype is capable of capturing complete 360° RGB-D videos with a resolution of two megapixels at 29 fps. Results demonstrate that our real-time method outperforms traditional omnidirectional stereo and learning-based omnidirectional stereo in terms of accuracy and performance.
一组带有鱼眼镜头的相机被用来捕捉广阔的视野。由于鱼眼镜头的光学特性,传统的基于极几何的扫描线立体算法直接不适用于这种非针孔相机设置;因此,现有完整的360°RGB-D成像系统很少实现实时性能。在本文中,我们介绍了一种高效的球面扫描立体,它可以直接运行在多视图鱼眼图像上,而无需额外的球面校正。我们的主要贡献是:首先,我们引入了一种自适应球面匹配方法,该方法考虑了每个输入鱼眼相机对球面畸变的分辨率。其次,我们提出了一种快速的尺度间双边代价体积滤波方法,该方法以最优复杂度为O(n)来细化噪声和无纹理区域的距离。它可以在保留边缘的同时实现实时密集距离估计。最后,通过快速绘制密集距离图,将鱼眼彩色图像和距离图像无缝结合成完整的360°RGB-D图像。我们展示了一个由移动GPU和四个鱼眼相机组成的嵌入式360°RGB-D成像原型。我们的原型能够捕捉完整的360°RGB-D视频,分辨率为200万像素,每秒29帧。结果表明,该方法在精度和性能上都优于传统的全向立体和基于学习的全向立体。
{"title":"Real-Time Sphere Sweeping Stereo from Multiview Fisheye Images","authors":"Andreas Meuleman, Hyeonjoong Jang, D. S. Jeon, Min H. Kim","doi":"10.1109/CVPR46437.2021.01126","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01126","url":null,"abstract":"A set of cameras with fisheye lenses have been used to capture a wide field of view. The traditional scan-line stereo algorithms based on epipolar geometry are directly inapplicable to this non-pinhole camera setup due to optical characteristics of fisheye lenses; hence, existing complete 360° RGB-D imaging systems have rarely achieved realtime performance yet. In this paper, we introduce an efficient sphere-sweeping stereo that can run directly on multiview fisheye images without requiring additional spherical rectification. Our main contributions are: First, we introduce an adaptive spherical matching method that accounts for each input fisheye camera’s resolving power concerning spherical distortion. Second, we propose a fast inter-scale bilateral cost volume filtering method that refines distance in noisy and textureless regions with optimal complexity of O(n). It enables real-time dense distance estimation while preserving edges. Lastly, the fisheye color and distance images are seamlessly combined into a complete 360° RGB-D image via fast inpainting of the dense distance map. We demonstrate an embedded 360° RGB-D imaging prototype composed of a mobile GPU and four fisheye cameras. Our prototype is capable of capturing complete 360° RGB-D videos with a resolution of two megapixels at 29 fps. Results demonstrate that our real-time method outperforms traditional omnidirectional stereo and learning-based omnidirectional stereo in terms of accuracy and performance.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123727539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers 暹罗自然语言跟踪器:跟踪自然语言描述与暹罗跟踪器
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00579
Qi Feng, Vitaly Ablavsky, Qinxun Bai, S. Sclaroff
We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range of Siamese trackers, providing a new class of baselines for the tracking by NL task and promising future improvements from the advancements of Siamese trackers. The carefully designed architecture of the Siamese Natural Language Region Proposal Network (SNL-RPN), together with the Dynamic Aggregation of vision and language modalities, is introduced to perform the tracking by NL task. Empirical results over tracking benchmarks with NL annotations show that the proposed SNLT improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of speed. The proposed SNLT outperforms all NL trackers to-date and is competitive among state-of-the-art real-time trackers on LaSOT benchmarks while running at 50 frames per second on a single GPU. Code for this work is available at https://github.com/fredfung007/snlt.
我们提出了一种新的Siamese自然语言跟踪器(SNLT),它将视觉跟踪的进步引入到自然语言描述跟踪任务中。所提出的SNLT适用于广泛的Siamese跟踪器,为NL任务跟踪提供了一类新的基线,并有望从Siamese跟踪器的进步中得到未来的改进。引入精心设计的Siamese自然语言区域建议网络(SNL-RPN)架构,结合视觉和语言模态的动态聚合,实现自然语言任务的跟踪。使用NL注释跟踪基准测试的经验结果表明,建议的SNLT将Siamese跟踪器提高了3到7个百分点,同时稍微牺牲了速度。提议的SNLT优于迄今为止所有的NL跟踪器,并且在LaSOT基准测试中具有最先进的实时跟踪器的竞争力,同时在单个GPU上以每秒50帧的速度运行。这项工作的代码可在https://github.com/fredfung007/snlt上获得。
{"title":"Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers","authors":"Qi Feng, Vitaly Ablavsky, Qinxun Bai, S. Sclaroff","doi":"10.1109/CVPR46437.2021.00579","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00579","url":null,"abstract":"We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range of Siamese trackers, providing a new class of baselines for the tracking by NL task and promising future improvements from the advancements of Siamese trackers. The carefully designed architecture of the Siamese Natural Language Region Proposal Network (SNL-RPN), together with the Dynamic Aggregation of vision and language modalities, is introduced to perform the tracking by NL task. Empirical results over tracking benchmarks with NL annotations show that the proposed SNLT improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of speed. The proposed SNLT outperforms all NL trackers to-date and is competitive among state-of-the-art real-time trackers on LaSOT benchmarks while running at 50 frames per second on a single GPU. Code for this work is available at https://github.com/fredfung007/snlt.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132857029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
A functional approach to rotation equivariant non-linearities for Tensor Field Networks 张量场网络旋转等变非线性的泛函方法
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01297
A. Poulenard, L. Guibas
Learning pose invariant representation is a fundamental problem in shape analysis. Most existing deep learning algorithms for 3D shape analysis are not robust to rotations and are often trained on synthetic datasets consisting of pre-aligned shapes, yielding poor generalization to unseen poses. This observation motivates a growing interest in rotation invariant and equivariant methods. The field of rotation equivariant deep learning is developing in recent years thanks to a well established theory of Lie group representations and convolutions. A fundamental problem in equivariant deep learning is to design activation functions which are both informative and preserve equivariance. The recently introduced Tensor Field Network (TFN) framework provides a rotation equivariant network design for point cloud analysis. TFN features undergo a rotation in feature space given a rotation of the input pointcloud. TFN and similar designs consider nonlinearities which operate only over rotation invariant features such as the norm of equivariant features to preserve equivariance, making them unable to capture the directional information. In a recent work entitled "Gauge Equivariant Mesh CNNs: Anisotropic Convolutions on Geometric Graphs" Hann et al. interpret 2D rotation equivariant features as Fourier coefficients of functions on the circle. In this work we transpose the idea of Hann et al. to 3D by interpreting TFN features as spherical harmonics coefficients of functions on the sphere. We introduce a new equivariant nonlinearity and pooling for TFN. We show improvments over the original TFN design and other equivariant nonlinearities in classification and segmentation tasks. Furthermore our method is competitive with state of the art rotation invariant methods in some instances.
姿态不变表示的学习是形状分析中的一个基本问题。大多数用于3D形状分析的现有深度学习算法对旋转不具有鲁棒性,并且通常是在由预对齐形状组成的合成数据集上进行训练的,因此对看不见的姿势的泛化效果很差。这一观察结果激发了人们对旋转不变和等变方法的兴趣。近年来,旋转等变深度学习领域的发展得益于李群表示和卷积理论的完善。等变深度学习的一个基本问题是如何设计既能提供信息又能保持等变的激活函数。最近引入的张量场网络(TFN)框架为点云分析提供了一种旋转等变网络设计。给定输入点云的旋转,TFN特征在特征空间中进行旋转。TFN和类似的设计考虑仅在旋转不变特征(如等变特征的范数)上运行的非线性,以保持等变,使它们无法捕获方向信息。在最近的一篇题为“规范等变网格cnn:几何图上的各向异性卷积”的文章中,Hann等人将二维旋转等变特征解释为圆上函数的傅里叶系数。在这项工作中,我们通过将TFN特征解释为球体上函数的球面谐波系数,将Hann等人的想法转到3D中。我们引入了一种新的等变非线性和TFN池化。我们展示了在分类和分割任务中对原始TFN设计和其他等变非线性的改进。此外,在某些情况下,我们的方法与最先进的旋转不变量方法相竞争。
{"title":"A functional approach to rotation equivariant non-linearities for Tensor Field Networks","authors":"A. Poulenard, L. Guibas","doi":"10.1109/CVPR46437.2021.01297","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01297","url":null,"abstract":"Learning pose invariant representation is a fundamental problem in shape analysis. Most existing deep learning algorithms for 3D shape analysis are not robust to rotations and are often trained on synthetic datasets consisting of pre-aligned shapes, yielding poor generalization to unseen poses. This observation motivates a growing interest in rotation invariant and equivariant methods. The field of rotation equivariant deep learning is developing in recent years thanks to a well established theory of Lie group representations and convolutions. A fundamental problem in equivariant deep learning is to design activation functions which are both informative and preserve equivariance. The recently introduced Tensor Field Network (TFN) framework provides a rotation equivariant network design for point cloud analysis. TFN features undergo a rotation in feature space given a rotation of the input pointcloud. TFN and similar designs consider nonlinearities which operate only over rotation invariant features such as the norm of equivariant features to preserve equivariance, making them unable to capture the directional information. In a recent work entitled \"Gauge Equivariant Mesh CNNs: Anisotropic Convolutions on Geometric Graphs\" Hann et al. interpret 2D rotation equivariant features as Fourier coefficients of functions on the circle. In this work we transpose the idea of Hann et al. to 3D by interpreting TFN features as spherical harmonics coefficients of functions on the sphere. We introduce a new equivariant nonlinearity and pooling for TFN. We show improvments over the original TFN design and other equivariant nonlinearities in classification and segmentation tasks. Furthermore our method is competitive with state of the art rotation invariant methods in some instances.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132261143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Iterative Filter Adaptive Network for Single Image Defocus Deblurring 单幅图像散焦去模糊的迭代滤波自适应网络
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00207
Junyong Lee, Hyeongseok Son, Jaesung Rim, Sunghyun Cho, Seungyong Lee
We propose a novel end-to-end learning-based approach for single image defocus deblurring. The proposed approach is equipped with a novel Iterative Filter Adaptive Network (IFAN) that is specifically designed to handle spatially-varying and large defocus blur. For adaptively handling spatially-varying blur, IFAN predicts pixel-wise deblurring filters, which are applied to defocused features of an input image to generate deblurred features. For effectively managing large blur, IFAN models deblurring filters as stacks of small-sized separable filters. Predicted separable deblurring filters are applied to defocused features using a novel Iterative Adaptive Convolution (IAC) layer. We also propose a training scheme based on defocus disparity estimation and reblurring, which significantly boosts the de-blurring quality. We demonstrate that our method achieves state-of-the-art performance both quantitatively and qualitatively on real-world images.
我们提出了一种新的基于端到端学习的单幅图像散焦去模糊方法。该方法配备了一种新颖的迭代滤波自适应网络(IFAN),该网络专门用于处理空间变化和大散焦模糊。为了自适应处理空间变化的模糊,IFAN预测逐像素的去模糊过滤器,将其应用于输入图像的散焦特征以生成去模糊特征。为了有效地管理大范围的模糊,IFAN将去模糊滤波器作为一堆小尺寸的可分离滤波器。利用一种新的迭代自适应卷积(IAC)层,将预测可分离去模糊滤波器应用于散焦特征。我们还提出了一种基于离焦视差估计和再模糊的训练方案,显著提高了去模糊质量。我们证明了我们的方法在真实世界的图像上实现了定量和定性的最先进的性能。
{"title":"Iterative Filter Adaptive Network for Single Image Defocus Deblurring","authors":"Junyong Lee, Hyeongseok Son, Jaesung Rim, Sunghyun Cho, Seungyong Lee","doi":"10.1109/CVPR46437.2021.00207","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00207","url":null,"abstract":"We propose a novel end-to-end learning-based approach for single image defocus deblurring. The proposed approach is equipped with a novel Iterative Filter Adaptive Network (IFAN) that is specifically designed to handle spatially-varying and large defocus blur. For adaptively handling spatially-varying blur, IFAN predicts pixel-wise deblurring filters, which are applied to defocused features of an input image to generate deblurred features. For effectively managing large blur, IFAN models deblurring filters as stacks of small-sized separable filters. Predicted separable deblurring filters are applied to defocused features using a novel Iterative Adaptive Convolution (IAC) layer. We also propose a training scheme based on defocus disparity estimation and reblurring, which significantly boosts the de-blurring quality. We demonstrate that our method achieves state-of-the-art performance both quantitatively and qualitatively on real-world images.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132300950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Physically-aware Generative Network for 3D Shape Modeling 三维形状建模的物理感知生成网络
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00921
Mariem Mezghanni, Malika Boulkenafed, A. Lieutier, M. Ovsjanikov
Shapes are often designed to satisfy structural properties and serve a particular functionality in the physical world. Unfortunately, most existing generative models focus primarily on the geometric or visual plausibility, ignoring the physical or structural constraints. To remedy this, we present a novel method aimed to endow deep generative models with physical reasoning. In particular, we introduce a loss and a learning framework that promote two key characteristics of the generated shapes: their connectivity and physical stability. The former ensures that each generated shape consists of a single connected component, while the latter promotes the stability of that shape when subjected to gravity. Our proposed physical losses are fully differentiable and we demonstrate their use in end-to-end learning. Crucially we demonstrate that such physical objectives can be achieved without sacrificing the expressive power of the model and variability of the generated results. We demonstrate through extensive comparisons with the state-of-the-art deep generative models, the utility and efficiency of our proposed approach, while avoiding the potentially costly differentiable physical simulation at training time.
形状通常是为了满足结构特性和服务于物理世界中的特定功能而设计的。不幸的是,大多数现有的生成模型主要关注几何或视觉上的合理性,而忽略了物理或结构上的限制。为了弥补这一点,我们提出了一种旨在赋予深度生成模型物理推理的新方法。特别地,我们引入了一个损失和一个学习框架,以促进生成形状的两个关键特征:它们的连通性和物理稳定性。前者确保每个生成的形状由单个连接的组件组成,而后者则促进了该形状在受到重力作用时的稳定性。我们提出的物理损失是完全可微分的,我们展示了它们在端到端学习中的应用。至关重要的是,我们证明了这样的物理目标可以在不牺牲模型的表达能力和生成结果的可变性的情况下实现。通过与最先进的深度生成模型的广泛比较,我们证明了我们提出的方法的实用性和效率,同时避免了在训练时潜在的昂贵的可微分物理模拟。
{"title":"Physically-aware Generative Network for 3D Shape Modeling","authors":"Mariem Mezghanni, Malika Boulkenafed, A. Lieutier, M. Ovsjanikov","doi":"10.1109/CVPR46437.2021.00921","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00921","url":null,"abstract":"Shapes are often designed to satisfy structural properties and serve a particular functionality in the physical world. Unfortunately, most existing generative models focus primarily on the geometric or visual plausibility, ignoring the physical or structural constraints. To remedy this, we present a novel method aimed to endow deep generative models with physical reasoning. In particular, we introduce a loss and a learning framework that promote two key characteristics of the generated shapes: their connectivity and physical stability. The former ensures that each generated shape consists of a single connected component, while the latter promotes the stability of that shape when subjected to gravity. Our proposed physical losses are fully differentiable and we demonstrate their use in end-to-end learning. Crucially we demonstrate that such physical objectives can be achieved without sacrificing the expressive power of the model and variability of the generated results. We demonstrate through extensive comparisons with the state-of-the-art deep generative models, the utility and efficiency of our proposed approach, while avoiding the potentially costly differentiable physical simulation at training time.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130420997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
SSLayout360: Semi-Supervised Indoor Layout Estimation from 360° Panorama SSLayout360:从360°全景半监督室内布局估计
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01510
Phi Vu Tran
Recent years have seen flourishing research on both semi-supervised learning and 3D room layout reconstruction. In this work, we explore the intersection of these two fields to advance the research objective of enabling more accurate 3D indoor scene modeling with less labeled data. We propose the first approach to learn representations of room corners and boundaries by using a combination of labeled and unlabeled data for improved layout estimation in a 360° panoramic scene. Through extensive comparative experiments, we demonstrate that our approach can advance layout estimation of complex indoor scenes using as few as 20 labeled examples. When coupled with a layout predictor pre-trained on synthetic data, our semi-supervised method matches the fully supervised counterpart using only 12% of the labels. Our work takes an important first step towards robust semi-supervised layout estimation that can enable many applications in 3D perception with limited labeled data.
近年来,对半监督学习和三维房间布局重建的研究蓬勃发展。在这项工作中,我们探索了这两个领域的交集,以推进用更少的标记数据实现更准确的3D室内场景建模的研究目标。我们提出了第一种方法,通过使用标记和未标记数据的组合来学习房间角落和边界的表示,以改进360°全景场景中的布局估计。通过大量的对比实验,我们证明了我们的方法可以使用少至20个标记示例来提高复杂室内场景的布局估计。当与预先在合成数据上训练的布局预测器相结合时,我们的半监督方法仅使用12%的标签与完全监督的对应方法相匹配。我们的工作向鲁棒半监督布局估计迈出了重要的第一步,它可以在有限的标记数据下实现许多3D感知应用。
{"title":"SSLayout360: Semi-Supervised Indoor Layout Estimation from 360° Panorama","authors":"Phi Vu Tran","doi":"10.1109/CVPR46437.2021.01510","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01510","url":null,"abstract":"Recent years have seen flourishing research on both semi-supervised learning and 3D room layout reconstruction. In this work, we explore the intersection of these two fields to advance the research objective of enabling more accurate 3D indoor scene modeling with less labeled data. We propose the first approach to learn representations of room corners and boundaries by using a combination of labeled and unlabeled data for improved layout estimation in a 360° panoramic scene. Through extensive comparative experiments, we demonstrate that our approach can advance layout estimation of complex indoor scenes using as few as 20 labeled examples. When coupled with a layout predictor pre-trained on synthetic data, our semi-supervised method matches the fully supervised counterpart using only 12% of the labels. Our work takes an important first step towards robust semi-supervised layout estimation that can enable many applications in 3D perception with limited labeled data.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130456444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Semantic-Aware Video Text Detection 语义感知视频文本检测
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00174
Wei Feng, Fei Yin, Xu-Yao Zhang, Cheng-Lin Liu
Most existing video text detection methods track texts with appearance features, which are easily influenced by the change of perspective and illumination. Compared with appearance features, semantic features are more robust cues for matching text instances. In this paper, we propose an end-to-end trainable video text detector that tracks texts based on semantic features. First, we introduce a new character center segmentation branch to extract semantic features, which encode the category and position of characters. Then we propose a novel appearance-semantic-geometry descriptor to track text instances, in which se-mantic features can improve the robustness against appearance changes. To overcome the lack of character-level an-notations, we propose a novel weakly-supervised character center detection module, which only uses word-level annotated real images to generate character-level labels. The proposed method achieves state-of-the-art performance on three video text benchmarks ICDAR 2013 Video, Minetto and RT-1K, and two Chinese scene text benchmarks CA-SIA10K and MSRA-TD500.
现有的视频文本检测方法大多跟踪具有外观特征的文本,这些特征容易受到视角和光照变化的影响。与外观特征相比,语义特征是更可靠的文本匹配线索。在本文中,我们提出了一种基于语义特征跟踪文本的端到端可训练视频文本检测器。首先,引入新的字符中心分割分支提取语义特征,对字符的类别和位置进行编码;然后,我们提出了一种新的外观-语义-几何描述符来跟踪文本实例,其中语义特征可以提高对外观变化的鲁棒性。为了克服字符级标注的不足,我们提出了一种新的弱监督字符中心检测模块,该模块仅使用单词级标注的真实图像来生成字符级标注。该方法在三个视频文本基准测试(ICDAR 2013 video、Minetto和RT-1K)和两个中文场景文本基准测试(CA-SIA10K和MSRA-TD500)上实现了最先进的性能。
{"title":"Semantic-Aware Video Text Detection","authors":"Wei Feng, Fei Yin, Xu-Yao Zhang, Cheng-Lin Liu","doi":"10.1109/CVPR46437.2021.00174","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00174","url":null,"abstract":"Most existing video text detection methods track texts with appearance features, which are easily influenced by the change of perspective and illumination. Compared with appearance features, semantic features are more robust cues for matching text instances. In this paper, we propose an end-to-end trainable video text detector that tracks texts based on semantic features. First, we introduce a new character center segmentation branch to extract semantic features, which encode the category and position of characters. Then we propose a novel appearance-semantic-geometry descriptor to track text instances, in which se-mantic features can improve the robustness against appearance changes. To overcome the lack of character-level an-notations, we propose a novel weakly-supervised character center detection module, which only uses word-level annotated real images to generate character-level labels. The proposed method achieves state-of-the-art performance on three video text benchmarks ICDAR 2013 Video, Minetto and RT-1K, and two Chinese scene text benchmarks CA-SIA10K and MSRA-TD500.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127992217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Self-generated Defocus Blur Detection via Dual Adversarial Discriminators 基于双对抗性鉴别器的自生成离焦模糊检测
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00686
Wenda Zhao, Cai Shang, Huchuan Lu
Although existing fully-supervised defocus blur detection (DBD) models significantly improve performance, training such deep models requires abundant pixel-level manual annotation, which is highly time-consuming and error-prone. Addressing this issue, this paper makes an effort to train a deep DBD model without using any pixel-level annotation. The core insight is that a defocus blur region/focused clear area can be arbitrarily pasted to a given realistic full blurred image/full clear image without affecting the judgment of the full blurred image/full clear image. Specifically, we train a generator G in an adversarial manner against dual discriminators Dc and Db. G learns to produce a DBD mask that generates a composite clear image and a composite blurred image through copying the focused area and unfocused region from corresponding source image to another full clear image and full blurred image. Then, Dc and Db can not distinguish them from realistic full clear image and full blurred image simultaneously, achieving a self-generated DBD by an implicit manner to define what a defocus blur area is. Besides, we propose a bilateral triplet-excavating constraint to avoid the degenerate problem caused by the case one discriminator defeats the other one. Comprehensive experiments on two widely-used DBD datasets demonstrate the superiority of the proposed approach. Source codes are available at: https://github.com/shangcai1/SG.
尽管现有的全监督散焦模糊检测(DBD)模型显著提高了性能,但训练这种深度模型需要大量的像素级手动注释,这非常耗时且容易出错。针对这一问题,本文尝试在不使用任何像素级标注的情况下训练深度DBD模型。核心观点是散焦模糊区域/聚焦清晰区域可以任意粘贴到给定的逼真的全模糊图像/全清晰图像上,而不会影响对全模糊图像/全清晰图像的判断。具体来说,我们以对抗性的方式训练发生器G,以对抗双鉴别器Dc和Db。G学习生成DBD掩模,通过将对应源图像的聚焦区域和未聚焦区域复制到另一幅全清晰图像和全模糊图像上,生成复合清晰图像和复合模糊图像。这样,Dc和Db就不能同时与真实的全清晰图像和全模糊图像区分开来,通过隐式定义什么是离焦模糊区域,实现了自生成的DBD。此外,我们还提出了一个双边三重挖掘约束,以避免由于一个鉴别器击败另一个鉴别器而引起的退化问题。在两个广泛使用的DBD数据集上的综合实验证明了该方法的优越性。源代码可在:https://github.com/shangcai1/SG。
{"title":"Self-generated Defocus Blur Detection via Dual Adversarial Discriminators","authors":"Wenda Zhao, Cai Shang, Huchuan Lu","doi":"10.1109/CVPR46437.2021.00686","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00686","url":null,"abstract":"Although existing fully-supervised defocus blur detection (DBD) models significantly improve performance, training such deep models requires abundant pixel-level manual annotation, which is highly time-consuming and error-prone. Addressing this issue, this paper makes an effort to train a deep DBD model without using any pixel-level annotation. The core insight is that a defocus blur region/focused clear area can be arbitrarily pasted to a given realistic full blurred image/full clear image without affecting the judgment of the full blurred image/full clear image. Specifically, we train a generator G in an adversarial manner against dual discriminators Dc and Db. G learns to produce a DBD mask that generates a composite clear image and a composite blurred image through copying the focused area and unfocused region from corresponding source image to another full clear image and full blurred image. Then, Dc and Db can not distinguish them from realistic full clear image and full blurred image simultaneously, achieving a self-generated DBD by an implicit manner to define what a defocus blur area is. Besides, we propose a bilateral triplet-excavating constraint to avoid the degenerate problem caused by the case one discriminator defeats the other one. Comprehensive experiments on two widely-used DBD datasets demonstrate the superiority of the proposed approach. Source codes are available at: https://github.com/shangcai1/SG.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128829362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos 通过观看社交媒体舞蹈视频学习高保真深度的穿着人类
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01256
Yasamin Jafarian, H. Park
A key challenge of learning the geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real-world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos that span diverse appearance, clothing styles, performances, and identities. Each video depicts dynamic movements of the body and clothes of a single person while lacking the 3D ground truth geometry. To utilize these videos, we present a new method to use the local transformation that warps the predicted local geometry of the person from an image to that of another image at a different time instant. This allows self-supervision as enforcing a temporal coherence over the predictions. In addition, we jointly learn the depth along with the surface normals that are highly responsive to local texture, wrinkle, and shade by maximizing their geometric consistency. Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image. We demonstrate that our method outperforms the state-of-the-art human depth estimation and human shape recovery approaches on both real and rendered images.
学习穿着人体几何的一个关键挑战在于地面真实数据(例如3D扫描模型)的有限可用性,这导致3D人体重建在应用于现实世界图像时性能下降。我们利用一种新的数据资源来应对这一挑战:一些社交媒体上的舞蹈视频,涵盖了不同的外表、服装风格、表演和身份。每个视频都描绘了一个人的身体和衣服的动态运动,而缺乏3D地面真实几何。为了利用这些视频,我们提出了一种新的方法,即使用局部变换,将预测的局部几何形状从一幅图像扭曲到另一幅图像的不同时刻。这使得自我监督可以在预测上加强时间一致性。此外,我们通过最大化其几何一致性,共同学习深度以及对局部纹理,皱纹和阴影高度响应的表面法线。我们的方法是端到端可训练的,导致高保真深度估计,预测忠实于输入真实图像的精细几何。我们证明了我们的方法在真实和渲染图像上都优于最先进的人体深度估计和人体形状恢复方法。
{"title":"Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos","authors":"Yasamin Jafarian, H. Park","doi":"10.1109/CVPR46437.2021.01256","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01256","url":null,"abstract":"A key challenge of learning the geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real-world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos that span diverse appearance, clothing styles, performances, and identities. Each video depicts dynamic movements of the body and clothes of a single person while lacking the 3D ground truth geometry. To utilize these videos, we present a new method to use the local transformation that warps the predicted local geometry of the person from an image to that of another image at a different time instant. This allows self-supervision as enforcing a temporal coherence over the predictions. In addition, we jointly learn the depth along with the surface normals that are highly responsive to local texture, wrinkle, and shade by maximizing their geometric consistency. Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image. We demonstrate that our method outperforms the state-of-the-art human depth estimation and human shape recovery approaches on both real and rendered images.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123092805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
期刊
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1