2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献_第7页

FM2u-Net: Face Morphological Multi-Branch Network for Makeup-Invariant Face Verification FM2u-Net:人脸形态多分支网络的化妆不变人脸验证

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00577

Wenxuan Wang, Yanwei Fu, Xuelin Qian, Yu-Gang Jiang, Qi Tian, X. Xue

It is challenging in learning a makeup-invariant face verification model, due to (1) insufficient makeup/non-makeup face training pairs, (2) the lack of diverse makeup faces, and (3) the significant appearance changes caused by cosmetics. To address these challenges, we propose a unified Face Morphological Multi-branch Network (FMMu-Net) for makeup-invariant face verification, which can simultaneously synthesize many diverse makeup faces through face morphology network (FM-Net) and effectively learn cosmetics-robust face representations using attention-based multi-branch learning network (AttM-Net). For challenges (1) and (2), FM-Net (two stacked auto-encoders) can synthesize realistic makeup face images by transferring specific regions of cosmetics via cycle consistent loss. For challenge (3), AttM-Net, consisting of one global and three local (task-driven on two eyes and mouth) branches, can effectively capture the complementary holistic and detailed information. Unlike DeepID2 which uses simple concatenation fusion, we introduce a heuristic method AttM-FM, attached to AttM-Net, to adaptively weight the features of different branches guided by the holistic information. We conduct extensive experiments on makeup face verification benchmarks (M-501, M-203, and FAM) and general face recognition datasets (LFW and IJB-A). Our framework FMMu-Net achieves state-of-the-art performances.

由于(1)化妆/不化妆面部训练对不足，(2)缺乏多样化的化妆面部，以及(3)化妆品引起的显着外观变化，因此学习化妆不变面部验证模型具有挑战性。为了解决这些问题，我们提出了一种统一的面部形态学多分支网络(FMMu-Net)用于化妆不变人脸验证，该网络可以通过面部形态学网络(FM-Net)同时合成多种不同的化妆脸，并使用基于注意力的多分支学习网络(AttM-Net)有效地学习化妆品鲁棒性面部表征。对于挑战(1)和(2)，FM-Net(两个堆叠的自编码器)可以通过循环一致损失转移化妆品的特定区域来合成逼真的化妆面部图像。对于挑战(3)，AttM-Net由一个全局分支和三个局部分支(两个眼和嘴的任务驱动)组成，可以有效地捕获互补的整体和详细信息。与DeepID2使用简单的拼接融合不同，我们引入了一种附加于AttM-Net的启发式方法AttM-FM，在整体信息的引导下自适应加权不同分支的特征。我们在化妆脸验证基准(M-501、M-203和FAM)和通用人脸识别数据集(LFW和IJB-A)上进行了广泛的实验。我们的框架FMMu-Net实现了最先进的性能。

{"title":"FM2u-Net: Face Morphological Multi-Branch Network for Makeup-Invariant Face Verification","authors":"Wenxuan Wang, Yanwei Fu, Xuelin Qian, Yu-Gang Jiang, Qi Tian, X. Xue","doi":"10.1109/cvpr42600.2020.00577","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00577","url":null,"abstract":"It is challenging in learning a makeup-invariant face verification model, due to (1) insufficient makeup/non-makeup face training pairs, (2) the lack of diverse makeup faces, and (3) the significant appearance changes caused by cosmetics. To address these challenges, we propose a unified Face Morphological Multi-branch Network (FMMu-Net) for makeup-invariant face verification, which can simultaneously synthesize many diverse makeup faces through face morphology network (FM-Net) and effectively learn cosmetics-robust face representations using attention-based multi-branch learning network (AttM-Net). For challenges (1) and (2), FM-Net (two stacked auto-encoders) can synthesize realistic makeup face images by transferring specific regions of cosmetics via cycle consistent loss. For challenge (3), AttM-Net, consisting of one global and three local (task-driven on two eyes and mouth) branches, can effectively capture the complementary holistic and detailed information. Unlike DeepID2 which uses simple concatenation fusion, we introduce a heuristic method AttM-FM, attached to AttM-Net, to adaptively weight the features of different branches guided by the holistic information. We conduct extensive experiments on makeup face verification benchmarks (M-501, M-203, and FAM) and general face recognition datasets (LFW and IJB-A). Our framework FMMu-Net achieves state-of-the-art performances.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"24 1","pages":"5729-5739"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90865609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Recognizing Objects From Any View With Object and Viewer-Centered Representations 用对象和以查看器为中心的表示来识别来自任何视图的对象

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.01180

Sainan Liu, Vincent Nguyen, Isaac Rehg, Z. Tu

In this paper, we tackle an important task in computer vision: any view object recognition. In both training and testing, for each object instance, we are only given its 2D image viewed from an unknown angle. We propose a computational framework by designing object and viewer-centered neural networks (OVCNet) to recognize an object instance viewed from an arbitrary unknown angle. OVCNet consists of three branches that respectively implement object-centered, 3D viewer-centered, and in-plane viewer-centered recognition. We evaluate our proposed OVCNet using two metrics with unseen views from both seen and novel object instances. Experimental results demonstrate the advantages of OVCNet over classic 2D-image-based CNN classifiers, 3D-object (inferred from 2D image) classifiers, and competing multi-view based approaches. It gives rise to a viable and practical computing framework that combines both viewpoint-dependent and viewpoint-independent features for object recognition from any view.

在本文中，我们解决了计算机视觉中的一个重要任务:任意视图对象识别。在训练和测试中，对于每个对象实例，我们只给出其从未知角度观看的2D图像。我们提出了一个计算框架，通过设计对象和观察者为中心的神经网络(OVCNet)来识别从任意未知角度观察的对象实例。OVCNet由三个分支组成，分别实现以对象为中心、以3D查看器为中心和以平面查看器为中心的识别。我们使用两个指标来评估我们提出的OVCNet，这些指标具有来自已见和新对象实例的未见视图。实验结果表明，OVCNet优于经典的基于2D图像的CNN分类器、3d对象(从2D图像推断)分类器和基于多视图的竞争方法。它提出了一种可行和实用的计算框架，结合了视点依赖和视点独立的特征，用于从任何视图识别物体。

引用次数: 3

Structure Aware Single-Stage 3D Object Detection From Point Cloud 基于点云的结构感知单阶段3D物体检测

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.01189

Chenhang He, Huiyu Zeng, Jianqiang Huang, Xiansheng Hua, Lei Zhang

3D object detection from point cloud data plays an essential role in autonomous driving. Current single-stage detectors are efficient by progressively downscaling the 3D point clouds in a fully convolutional manner. However, the downscaled features inevitably lose spatial information and cannot make full use of the structure information of 3D point cloud, degrading their localization precision. In this work, we propose to improve the localization precision of single-stage detectors by explicitly leveraging the structure information of 3D point cloud. Specifically, we design an auxiliary network which converts the convolutional features in the backbone network back to point-level representations. The auxiliary network is jointly optimized, by two point-level supervisions, to guide the convolutional features in the backbone network to be aware of the object structure. The auxiliary network can be detached after training and therefore introduces no extra computation in the inference stage. Besides, considering that single-stage detectors suffer from the discordance between the predicted bounding boxes and corresponding classification confidences, we develop an efficient part-sensitive warping operation to align the confidences to the predicted bounding boxes. Our proposed detector ranks at the top of KITTI 3D/BEV detection leaderboards and runs at 25 FPS for inference.

基于点云数据的三维目标检测在自动驾驶中起着至关重要的作用。目前的单级检测器通过以全卷积的方式逐步缩小三维点云来提高效率。然而，缩小后的特征不可避免地会丢失空间信息，不能充分利用三维点云的结构信息，降低了定位精度。在这项工作中，我们提出通过显式利用三维点云的结构信息来提高单级探测器的定位精度。具体来说，我们设计了一个辅助网络，将骨干网络中的卷积特征转换回点级表示。辅助网络通过两个点级监督进行联合优化，引导主干网中的卷积特征感知目标结构。辅助网络可以在训练后分离，因此在推理阶段不会引入额外的计算。此外，考虑到单级检测器存在预测边界框与相应分类置信度不一致的问题，我们开发了一种有效的部分敏感的扭曲操作，使置信度与预测边界框对齐。我们提出的检测器在KITTI 3D/BEV检测排行榜上名列前茅，并以每秒25帧的速度运行。

{"title":"Structure Aware Single-Stage 3D Object Detection From Point Cloud","authors":"Chenhang He, Huiyu Zeng, Jianqiang Huang, Xiansheng Hua, Lei Zhang","doi":"10.1109/cvpr42600.2020.01189","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01189","url":null,"abstract":"3D object detection from point cloud data plays an essential role in autonomous driving. Current single-stage detectors are efficient by progressively downscaling the 3D point clouds in a fully convolutional manner. However, the downscaled features inevitably lose spatial information and cannot make full use of the structure information of 3D point cloud, degrading their localization precision. In this work, we propose to improve the localization precision of single-stage detectors by explicitly leveraging the structure information of 3D point cloud. Specifically, we design an auxiliary network which converts the convolutional features in the backbone network back to point-level representations. The auxiliary network is jointly optimized, by two point-level supervisions, to guide the convolutional features in the backbone network to be aware of the object structure. The auxiliary network can be detached after training and therefore introduces no extra computation in the inference stage. Besides, considering that single-stage detectors suffer from the discordance between the predicted bounding boxes and corresponding classification confidences, we develop an efficient part-sensitive warping operation to align the confidences to the predicted bounding boxes. Our proposed detector ranks at the top of KITTI 3D/BEV detection leaderboards and runs at 25 FPS for inference.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"49 1","pages":"11870-11879"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90231944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 350

Label Decoupling Framework for Salient Object Detection 显著目标检测的标签解耦框架

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.01304

Junhang Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, Q. Tian

To get more accurate saliency maps, recent methods mainly focus on aggregating multi-level features from fully convolutional network (FCN) and introducing edge information as auxiliary supervision. Though remarkable progress has been achieved, we observe that the closer the pixel is to the edge, the more difficult it is to be predicted, because edge pixels have a very imbalance distribution. To address this problem, we propose a label decoupling framework (LDF) which consists of a label decoupling (LD) procedure and a feature interaction network (FIN). LD explicitly decomposes the original saliency map into body map and detail map, where body map concentrates on center areas of objects and detail map focuses on regions around edges. Detail map works better because it involves much more pixels than traditional edge supervision. Different from saliency map, body map discards edge pixels and only pays attention to center areas. This successfully avoids the distraction from edge pixels during training. Therefore, we employ two branches in FIN to deal with body map and detail map respectively. Feature interaction (FI) is designed to fuse the two complementary branches to predict the saliency map, which is then used to refine the two branches again. This iterative refinement is helpful for learning better representations and more precise saliency maps. Comprehensive experiments on six benchmark datasets demonstrate that LDF outperforms state-of-the-art approaches on different evaluation metrics.

为了获得更精确的显著性图，目前的方法主要集中在从全卷积网络(FCN)中聚合多层特征，并引入边缘信息作为辅助监督。虽然已经取得了显著的进展，但我们观察到，像素越靠近边缘，就越难以预测，因为边缘像素的分布非常不平衡。为了解决这个问题，我们提出了一个标签解耦框架(LDF)，该框架由标签解耦(LD)过程和特征交互网络(FIN)组成。LD明确地将原有的显著性图分解为体图和细部图，其中体图集中在物体的中心区域，细部图集中在物体边缘附近的区域。细节图效果更好，因为它比传统的边缘监督涉及更多的像素。与显著性图不同，体图抛弃边缘像素，只关注中心区域。这成功地避免了训练过程中边缘像素的干扰。因此，我们在FIN中使用两个分支分别处理体图和细部图。特征交互(Feature interaction, FI)的目的是融合两个互补分支来预测显著性图，然后再使用显著性图来优化两个分支。这种迭代细化有助于学习更好的表示和更精确的显著性映射。在六个基准数据集上的综合实验表明，LDF在不同的评估指标上优于最先进的方法。

{"title":"Label Decoupling Framework for Salient Object Detection","authors":"Junhang Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, Q. Tian","doi":"10.1109/cvpr42600.2020.01304","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01304","url":null,"abstract":"To get more accurate saliency maps, recent methods mainly focus on aggregating multi-level features from fully convolutional network (FCN) and introducing edge information as auxiliary supervision. Though remarkable progress has been achieved, we observe that the closer the pixel is to the edge, the more difficult it is to be predicted, because edge pixels have a very imbalance distribution. To address this problem, we propose a label decoupling framework (LDF) which consists of a label decoupling (LD) procedure and a feature interaction network (FIN). LD explicitly decomposes the original saliency map into body map and detail map, where body map concentrates on center areas of objects and detail map focuses on regions around edges. Detail map works better because it involves much more pixels than traditional edge supervision. Different from saliency map, body map discards edge pixels and only pays attention to center areas. This successfully avoids the distraction from edge pixels during training. Therefore, we employ two branches in FIN to deal with body map and detail map respectively. Feature interaction (FI) is designed to fuse the two complementary branches to predict the saliency map, which is then used to refine the two branches again. This iterative refinement is helpful for learning better representations and more precise saliency maps. Comprehensive experiments on six benchmark datasets demonstrate that LDF outperforms state-of-the-art approaches on different evaluation metrics.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"10 1","pages":"13022-13031"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87607887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 181

CookGAN: Causality Based Text-to-Image Synthesis 基于因果关系的文本到图像合成

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00556

B. Zhu, C. Ngo

This paper addresses the problem of text-to-image synthesis from a new perspective, i.e., the cause-and-effect chain in image generation. Causality is a common phenomenon in cooking. The dish appearance changes depending on the cooking actions and ingredients. The challenge of synthesis is that a generated image should depict the visual result of action-on-object. This paper presents a new network architecture, CookGAN, that mimics visual effect in causality chain, preserves fine-grained details and progressively upsamples image. Particularly, a cooking simulator sub-network is proposed to incrementally make changes to food images based on the interaction between ingredients and cooking methods over a series of steps. Experiments on Recipe1M verify that CookGAN manages to generate food images with reasonably impressive inception score. Furthermore, the images are semantically interpretable and manipulable.

本文从一个新的角度，即图像生成中的因果链，来解决文本到图像的合成问题。因果关系是烹饪中常见的现象。菜肴的外观根据烹饪动作和配料的不同而变化。合成的挑战在于生成的图像应该描述动作对对象的视觉结果。本文提出了一种新的网络结构——CookGAN，它模仿了因果链中的视觉效果，保留了细粒度的细节，并逐步对图像进行了上采样。特别地，提出了一个烹饪模拟器子网络，该网络基于食材和烹饪方法之间的一系列步骤的相互作用，对食物图像进行增量更改。在Recipe1M上的实验验证了CookGAN能够生成具有相当令人印象深刻的初始分数的食物图像。此外，图像在语义上是可解释和可操作的。

引用次数: 46

Learning Fused Pixel and Feature-Based View Reconstructions for Light Fields 学习融合像素和基于特征的光场视图重建

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.00263

Jinglei Shi, Xiaoran Jiang, C. Guillemot

In this paper, we present a learning-based framework for light field view synthesis from a subset of input views. Building upon a light-weight optical flow estimation network to obtain depth maps, our method employs two reconstruction modules in pixel and feature domains respectively. For the pixel-wise reconstruction, occlusions are explicitly handled by a disparity-dependent interpolation filter, whereas inpainting on disoccluded areas is learned by convolutional layers. Due to disparity inconsistencies, the pixel-based reconstruction may lead to blurriness in highly textured areas as well as on object contours. On the contrary, the feature-based reconstruction well performs on high frequencies, making the reconstruction in the two domains complementary. End-to-end learning is finally performed including a fusion module merging pixel and feature-based reconstructions. Experimental results show that our method achieves state-of-the-art performance on both synthetic and real-world datasets, moreover, it is even able to extend light fields' baseline by extrapolating high quality views without additional training.

在本文中，我们提出了一个基于学习的框架，用于从输入视图子集中合成光场视图。该方法基于轻量级光流估计网络获取深度图，在像素域和特征域分别使用两个重构模块。对于逐像素重建，遮挡由差异相关的插值滤波器显式处理，而对未遮挡区域的绘制则由卷积层学习。由于视差不一致，基于像素的重建可能导致高度纹理区域以及物体轮廓的模糊。相反，基于特征的重构在高频上表现良好，使两个域的重构互补。最后进行端到端学习，包括融合模块合并像素和基于特征的重建。实验结果表明，我们的方法在合成数据集和真实数据集上都达到了最先进的性能，而且，它甚至可以通过外推高质量的视图来扩展光场的基线，而无需额外的训练。

{"title":"Learning Fused Pixel and Feature-Based View Reconstructions for Light Fields","authors":"Jinglei Shi, Xiaoran Jiang, C. Guillemot","doi":"10.1109/CVPR42600.2020.00263","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00263","url":null,"abstract":"In this paper, we present a learning-based framework for light field view synthesis from a subset of input views. Building upon a light-weight optical flow estimation network to obtain depth maps, our method employs two reconstruction modules in pixel and feature domains respectively. For the pixel-wise reconstruction, occlusions are explicitly handled by a disparity-dependent interpolation filter, whereas inpainting on disoccluded areas is learned by convolutional layers. Due to disparity inconsistencies, the pixel-based reconstruction may lead to blurriness in highly textured areas as well as on object contours. On the contrary, the feature-based reconstruction well performs on high frequencies, making the reconstruction in the two domains complementary. End-to-end learning is finally performed including a fusion module merging pixel and feature-based reconstructions. Experimental results show that our method achieves state-of-the-art performance on both synthetic and real-world datasets, moreover, it is even able to extend light fields' baseline by extrapolating high quality views without additional training.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"10 1","pages":"2552-2561"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88097333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Retina-Like Visual Image Reconstruction via Spiking Neural Model 基于脉冲神经模型的视网膜样视觉图像重建

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00151

Lin Zhu, Siwei Dong, Jianing Li, Tiejun Huang, Yonghong Tian

The high-sensitivity vision of primates, including humans, is mediated by a small retinal region called the fovea. As a novel bio-inspired vision sensor, spike camera mimics the fovea to record the nature scenes by continuous-time spikes instead of frame-based manner. However, reconstructing visual images from the spikes remains to be a challenge. In this paper, we design a retina-like visual image reconstruction framework, which is flexible in reconstructing full texture of natural scenes from the totally new spike data. Specifically, the proposed architecture consists of motion local excitation layer, spike refining layer and visual reconstruction layer motivated by bio-realistic leaky integrate and fire (LIF) neurons and synapse connection with spike-timing-dependent plasticity (STDP) rules. This approach may represent a major shift from conventional frame-based vision to the continuous-time retina-like vision, owning to the advantages of high temporal resolution and low power consumption. To test the performance, a spike dataset is constructed which is recorded by the spike camera. The experimental results show that the proposed approach is extremely effective in reconstructing the visual image in both normal and high speed scenes, while achieving high dynamic range and high image quality.

包括人类在内的灵长类动物的高灵敏度视觉是由一个叫做中央凹的小视网膜区域调节的。刺突相机是一种新型的仿生视觉传感器，它以连续时间的刺突代替基于帧的方式来模拟中央凹来记录自然场景。然而，从尖刺中重建视觉图像仍然是一个挑战。本文设计了一种类似视网膜的视觉图像重建框架，该框架可以灵活地从全新的峰值数据中重建自然场景的全纹理。具体而言，该结构由运动局部激励层、脉冲精炼层和视觉重建层组成，这些层由生物逼真的漏积分和火(LIF)神经元驱动，并根据spike- time -dependent plasticity (STDP)规则连接突触。该方法具有高时间分辨率和低功耗的优点，可能代表着从传统的基于框架的视觉到连续时间类视网膜视觉的重大转变。为了测试该算法的性能，构建了一个由spike摄像机记录的spike数据集。实验结果表明，该方法在正常和高速场景下都能非常有效地重建视觉图像，同时实现高动态范围和高图像质量。

{"title":"Retina-Like Visual Image Reconstruction via Spiking Neural Model","authors":"Lin Zhu, Siwei Dong, Jianing Li, Tiejun Huang, Yonghong Tian","doi":"10.1109/cvpr42600.2020.00151","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00151","url":null,"abstract":"The high-sensitivity vision of primates, including humans, is mediated by a small retinal region called the fovea. As a novel bio-inspired vision sensor, spike camera mimics the fovea to record the nature scenes by continuous-time spikes instead of frame-based manner. However, reconstructing visual images from the spikes remains to be a challenge. In this paper, we design a retina-like visual image reconstruction framework, which is flexible in reconstructing full texture of natural scenes from the totally new spike data. Specifically, the proposed architecture consists of motion local excitation layer, spike refining layer and visual reconstruction layer motivated by bio-realistic leaky integrate and fire (LIF) neurons and synapse connection with spike-timing-dependent plasticity (STDP) rules. This approach may represent a major shift from conventional frame-based vision to the continuous-time retina-like vision, owning to the advantages of high temporal resolution and low power consumption. To test the performance, a spike dataset is constructed which is recorded by the spike camera. The experimental results show that the proposed approach is extremely effective in reconstructing the visual image in both normal and high speed scenes, while achieving high dynamic range and high image quality.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"14 1","pages":"1435-1443"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87329231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by learning to unfold OrigamiNet:弱监督，无分割，一步，通过学习展开的全页文本识别

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.01472

Mohamed Yousef, Tom E. Bishop

Text recognition is a major computer vision task with a big set of associated challenges. One of those traditional challenges is the coupled nature of text recognition and segmentation. This problem has been progressively solved over the past decades, going from segmentation based recognition to segmentation free approaches, which proved more accurate and much cheaper to annotate data for. We take a step from segmentation-free single line recognition towards segmentation-free multi-line / full page recognition. We propose a novel and simple neural network module, termed OrigamiNet, that can augment any CTC-trained, fully convolutional single line text recognizer, to convert it into a multi-line version by providing the model with enough spatial capacity to be able to properly collapse a 2D input signal into 1D without losing information. Such modified networks can be trained using exactly their same simple original procedure, and using only unsegmented image and text pairs. We carry out a set of interpretability experiments that show that our trained models learn an accurate implicit line segmentation. We achieve state-of-the-art character error rate on both IAM & ICDAR 2017 HTR benchmarks for handwriting recognition, surpassing all other methods in the literature. On IAM we even surpass single line methods that use accurate localization information during training. Our code is available online at https://github.com/IntuitionMachines/OrigamiNet .

文本识别是一项重要的计算机视觉任务，具有一系列相关的挑战。其中一个传统的挑战是文本识别和分割的耦合性。在过去的几十年里，这个问题已经逐步得到解决，从基于分割的识别到无分割的方法，事实证明，这种方法更准确，而且注释数据的成本更低。我们从无分割的单行识别向无分割的多行/全页识别迈出了一步。我们提出了一种新颖而简单的神经网络模块，称为OrigamiNet，它可以增强任何ctc训练的全卷积单行文本识别器，通过为模型提供足够的空间容量，使其能够将2D输入信号适当地折叠成1D而不丢失信息，从而将其转换为多行版本。这种改进后的网络可以使用完全相同的简单原始程序进行训练，并且只使用未分割的图像和文本对。我们进行了一组可解释性实验，表明我们训练的模型学习了准确的隐式线分割。我们在手写识别的IAM和ICDAR 2017 HTR基准上实现了最先进的字符错误率，超过了文献中的所有其他方法。在IAM上，我们甚至超越了在训练过程中使用准确定位信息的单行方法。我们的代码可在https://github.com/IntuitionMachines/OrigamiNet上在线获得。

{"title":"OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by learning to unfold","authors":"Mohamed Yousef, Tom E. Bishop","doi":"10.1109/CVPR42600.2020.01472","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01472","url":null,"abstract":"Text recognition is a major computer vision task with a big set of associated challenges. One of those traditional challenges is the coupled nature of text recognition and segmentation. This problem has been progressively solved over the past decades, going from segmentation based recognition to segmentation free approaches, which proved more accurate and much cheaper to annotate data for. We take a step from segmentation-free single line recognition towards segmentation-free multi-line / full page recognition. We propose a novel and simple neural network module, termed OrigamiNet, that can augment any CTC-trained, fully convolutional single line text recognizer, to convert it into a multi-line version by providing the model with enough spatial capacity to be able to properly collapse a 2D input signal into 1D without losing information. Such modified networks can be trained using exactly their same simple original procedure, and using only unsegmented image and text pairs. We carry out a set of interpretability experiments that show that our trained models learn an accurate implicit line segmentation. We achieve state-of-the-art character error rate on both IAM & ICDAR 2017 HTR benchmarks for handwriting recognition, surpassing all other methods in the literature. On IAM we even surpass single line methods that use accurate localization information during training. Our code is available online at https://github.com/IntuitionMachines/OrigamiNet .","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"9 1","pages":"14698-14707"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87470566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Hierarchical Pyramid Diverse Attention Networks for Face Recognition 基于层次金字塔的多注意力网络人脸识别

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00835

Qiangchang Wang, Tianyi Wu, He Zheng, G. Guo

Deep learning has achieved a great success in face recognition (FR), however, few existing models take hierarchical multi-scale local features into consideration. In this work, we propose a hierarchical pyramid diverse attention (HPDA) network. First, it is observed that local patches would play important roles in FR when the global face appearance changes dramatically. Some recent works apply attention modules to locate local patches automatically without relying on face landmarks. Unfortunately, without considering diversity, some learned attentions tend to have redundant responses around some similar local patches, while neglecting other potential discriminative facial parts. Meanwhile, local patches may appear at different scales due to pose variations or large expression changes. To alleviate these challenges, we propose a pyramid diverse attention (PDA) to learn multi-scale diverse local representations automatically and adaptively. More specifically, a pyramid attention is developed to capture multi-scale features. Meanwhile, a diverse learning is developed to encourage models to focus on different local patches and generate diverse local features. Second, almost all existing models focus on extracting features from the last convolutional layer, lacking of local details or small-scale face parts in lower layers. Instead of simple concatenation or addition, we propose to use a hierarchical bilinear pooling (HBP) to fuse information from multiple layers effectively. Thus, the HPDA is developed by integrating the PDA into the HBP. Experimental results on several datasets show the effectiveness of the HPDA, compared to the state-of-the-art methods.

深度学习在人脸识别中取得了很大的成功，但是现有的模型很少考虑到层次多尺度局部特征。在这项工作中，我们提出了一个层次金字塔不同注意力(HPDA)网络。首先，研究发现，当全球人脸外观发生剧烈变化时，局部斑块在人脸识别中发挥着重要作用。最近的一些研究使用注意力模块来自动定位局部斑块，而不依赖于面部地标。不幸的是，如果不考虑多样性，一些习得的注意往往会在一些相似的局部斑块周围产生冗余反应，而忽略了其他潜在的歧视性面部部位。同时，由于位姿变化或表达变化较大，局部斑块可能在不同尺度上出现。为了缓解这些挑战，我们提出了一种金字塔多样化注意力(PDA)来自动自适应地学习多尺度不同的局部表征。更具体地说，金字塔注意力是用来捕捉多尺度特征的。同时，开发了一种多样化的学习方法，鼓励模型关注不同的局部补丁，生成不同的局部特征。其次，几乎所有现有的模型都侧重于从最后一层卷积提取特征，缺乏底层的局部细节或小尺度人脸部分。我们建议使用分层双线性池(HBP)来有效地融合来自多个层的信息，而不是简单的连接或添加。因此，通过将PDA集成到HBP中来开发HPDA。在多个数据集上的实验结果表明，与目前最先进的方法相比，HPDA是有效的。

{"title":"Hierarchical Pyramid Diverse Attention Networks for Face Recognition","authors":"Qiangchang Wang, Tianyi Wu, He Zheng, G. Guo","doi":"10.1109/cvpr42600.2020.00835","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00835","url":null,"abstract":"Deep learning has achieved a great success in face recognition (FR), however, few existing models take hierarchical multi-scale local features into consideration. In this work, we propose a hierarchical pyramid diverse attention (HPDA) network. First, it is observed that local patches would play important roles in FR when the global face appearance changes dramatically. Some recent works apply attention modules to locate local patches automatically without relying on face landmarks. Unfortunately, without considering diversity, some learned attentions tend to have redundant responses around some similar local patches, while neglecting other potential discriminative facial parts. Meanwhile, local patches may appear at different scales due to pose variations or large expression changes. To alleviate these challenges, we propose a pyramid diverse attention (PDA) to learn multi-scale diverse local representations automatically and adaptively. More specifically, a pyramid attention is developed to capture multi-scale features. Meanwhile, a diverse learning is developed to encourage models to focus on different local patches and generate diverse local features. Second, almost all existing models focus on extracting features from the last convolutional layer, lacking of local details or small-scale face parts in lower layers. Instead of simple concatenation or addition, we propose to use a hierarchical bilinear pooling (HBP) to fuse information from multiple layers effectively. Thus, the HPDA is developed by integrating the PDA into the HBP. Experimental results on several datasets show the effectiveness of the HPDA, compared to the state-of-the-art methods.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"8323-8332"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88294079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Warp to the Future: Joint Forecasting of Features and Feature Motion 曲向未来:特征和特征运动的联合预测

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.01066

Josip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa Segvic

We address anticipation of scene development by forecasting semantic segmentation of future frames. Several previous works approach this problem by F2F (feature-to-feature) forecasting where future features are regressed from observed features. Different from previous work, we consider a novel F2M (feature-to-motion) formulation, which performs the forecast by warping observed features according to regressed feature flow. This formulation models a causal relationship between the past and the future, and regularizes inference by reducing dimensionality of the forecasting target. However, emergence of future scenery which was not visible in observed frames can not be explained by warping. We propose to address this issue by complementing F2M forecasting with the classic F2F approach. We realize this idea as a multi-head F2MF model built atop shared features. Experiments show that the F2M head prevails in static parts of the scene while the F2F head kicks-in to fill-in the novel regions. The proposed F2MF model operates in synergy with correlation features and outperforms all previous approaches both in short-term and mid-term forecast on the Cityscapes dataset.

我们通过预测未来帧的语义分割来解决场景发展的预期。之前的一些研究通过F2F(特征到特征)预测来解决这个问题，其中未来的特征是从观察到的特征中回归的。与之前的工作不同，我们考虑了一种新的F2M (feature-to-motion)公式，该公式通过根据回归的特征流对观察到的特征进行翘曲来进行预测。该公式建立了过去和未来之间的因果关系模型，并通过降低预测目标的维数使推理规范化。然而，在观察到的帧中不可见的未来场景的出现不能用翘曲来解释。我们建议用经典的F2F方法补充F2M预测来解决这个问题。我们将这个想法实现为建立在共享功能之上的多头F2MF模型。实验表明，F2M头部在场景的静态部分占上风，而F2F头部则会在新的区域进行填充。所提出的F2MF模型与相关特征协同工作，在城市景观数据集的短期和中期预测方面优于所有先前的方法。

{"title":"Warp to the Future: Joint Forecasting of Features and Feature Motion","authors":"Josip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa Segvic","doi":"10.1109/CVPR42600.2020.01066","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01066","url":null,"abstract":"We address anticipation of scene development by forecasting semantic segmentation of future frames. Several previous works approach this problem by F2F (feature-to-feature) forecasting where future features are regressed from observed features. Different from previous work, we consider a novel F2M (feature-to-motion) formulation, which performs the forecast by warping observed features according to regressed feature flow. This formulation models a causal relationship between the past and the future, and regularizes inference by reducing dimensionality of the forecasting target. However, emergence of future scenery which was not visible in observed frames can not be explained by warping. We propose to address this issue by complementing F2M forecasting with the classic F2F approach. We realize this idea as a multi-head F2MF model built atop shared features. Experiments show that the F2M head prevails in static parts of the scene while the F2F head kicks-in to fill-in the novel regions. The proposed F2MF model operates in synergy with correlation features and outperforms all previous approaches both in short-term and mid-term forecast on the Cityscapes dataset.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"78 1","pages":"10645-10654"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88647808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21