2017 14th Conference on Computer and Robot Vision (CRV)最新文献_第3页

Unsupervised Online Learning for Fine-Grained Hand Segmentation in Egocentric Video 自中心视频中精细手部分割的无监督在线学习

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-16 DOI: 10.1109/CRV.2017.17

Ying Zhao, Zhiwei Luo, Changqin Quan

Hand segmentation is one of the most fundamental and crucial steps for egocentric human-computer interaction. The special egocentric view brings new challenges to hand segmentation task, such as the unpredictable environmental conditions. The performance of traditional hand segmentation methods depend on abundant manually labeled training data. However, these approaches do not appropriately capture the whole properties of egocentric human-computer interaction for neglecting the user-specific context. It is only necessary to build a personalized hand model of the active user. Based on this observation, we propose an online-learning hand segmentation approach without using manually labeled data for training. Our approach consists of top-down classifications and bottom-up optimizations. More specifically, we divide the segmentation task into three parts, a frame-level hand detection which detects the presence of the interactive hand using motion saliency and initializes hand masks for online learning, a superpixel-level hand classification which coarsely segments hand regions from which stable samples are selected for next level, and a pixel-level hand classification which produces a fine-grained hand segmentation. Based on the pixel-level classification result, we update the hand appearance model and optimize the upper layer classifier and detector. This online-learning strategy makes our approach robust to varying illumination conditions and hand appearances. Experimental results demonstrate the robustness of our approach.

手部分割是实现以自我为中心的人机交互的最基本、最关键的步骤之一。这种特殊的自我中心观点给手分割任务带来了新的挑战，如不可预测的环境条件。传统的手分割方法的性能依赖于大量人工标记的训练数据。然而，这些方法并没有适当地捕捉到以自我为中心的人机交互的全部属性，因为它们忽略了用户特定的上下文。只需要建立活跃用户的个性化手部模型。基于这一观察，我们提出了一种在线学习的手部分割方法，而不使用手动标记的数据进行训练。我们的方法由自上而下的分类和自下而上的优化组成。更具体地说，我们将分割任务分为三个部分:帧级手部检测，使用运动显著性检测交互式手部的存在并初始化用于在线学习的手部遮罩;超像素级手部分类，粗略分割手部区域，从中选择稳定的样本进行下一阶段;像素级手部分类，产生细粒度的手部分割。基于像素级分类结果，更新手部外观模型，优化上层分类器和检测器。这种在线学习策略使我们的方法对不同的照明条件和手的外观具有鲁棒性。实验结果证明了该方法的鲁棒性。

{"title":"Unsupervised Online Learning for Fine-Grained Hand Segmentation in Egocentric Video","authors":"Ying Zhao, Zhiwei Luo, Changqin Quan","doi":"10.1109/CRV.2017.17","DOIUrl":"https://doi.org/10.1109/CRV.2017.17","url":null,"abstract":"Hand segmentation is one of the most fundamental and crucial steps for egocentric human-computer interaction. The special egocentric view brings new challenges to hand segmentation task, such as the unpredictable environmental conditions. The performance of traditional hand segmentation methods depend on abundant manually labeled training data. However, these approaches do not appropriately capture the whole properties of egocentric human-computer interaction for neglecting the user-specific context. It is only necessary to build a personalized hand model of the active user. Based on this observation, we propose an online-learning hand segmentation approach without using manually labeled data for training. Our approach consists of top-down classifications and bottom-up optimizations. More specifically, we divide the segmentation task into three parts, a frame-level hand detection which detects the presence of the interactive hand using motion saliency and initializes hand masks for online learning, a superpixel-level hand classification which coarsely segments hand regions from which stable samples are selected for next level, and a pixel-level hand classification which produces a fine-grained hand segmentation. Based on the pixel-level classification result, we update the hand appearance model and optimize the upper layer classifier and detector. This online-learning strategy makes our approach robust to varying illumination conditions and hand appearances. Experimental results demonstrate the robustness of our approach.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122240783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Night Rider: Visual Odometry Using Headlights 夜骑士:使用车头灯的视觉里程计

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-16 DOI: 10.1109/CRV.2017.48

K. MacTavish, M. Paton, T. Barfoot

Visual Odometry (VO) is a key enabling technology for mobile robotic systems that provides a relative motion estimate from a sequence of camera images. Cameras are comparatively inexpensive sensors, and provide large amounts of useful data, making them one of the most common sensors in mobile robotics. However, because they are passive, they are dependent on external lighting, which can restrict their usefulness. Using headlights as an alternate lighting source, this paper investigates outdoor stereo VO performance under all lighting conditions during nearly 10 km of driving over 30 hours. Challenges include limited visibility range, a dynamic light source, intensity hotspots, and others. Another large issue comes from blooming and lens flare at dawn and dusk, when the camera is looking directly into the sun. In our experiments, nighttime driving with headlights has a moderately increased error of 2.38% over 250 m compared to the daytime error of 1.5%. To the best of our knowledge this is the first quantitative study of VO performance at night using headlights.

视觉里程计(VO)是移动机器人系统的一项关键使能技术，它可以从一系列相机图像中提供相对运动估计。相机是相对便宜的传感器，并提供大量有用的数据，使其成为移动机器人中最常见的传感器之一。然而，由于它们是被动的，它们依赖于外部照明，这可能会限制它们的用途。本文以车头灯作为备用照明光源，在30小时的近10公里的行驶过程中，研究了所有照明条件下的室外立体VO性能。挑战包括有限的可见范围、动态光源、强度热点等。另一个大问题来自黎明和黄昏时的光晕和镜头光晕，此时相机正对着太阳。在我们的实验中，与白天1.5%的误差相比，夜间使用前灯驾驶在250米以上的误差适度增加了2.38%。据我们所知，这是VO在夜间使用前灯性能的第一个定量研究。

引用次数: 12

Towards Transferring Grasping from Human to Robot with RGBD Hand Detection 基于RGBD手部检测的机器人抓取转移研究

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-16 DOI: 10.1109/CRV.2017.45

Rong Feng, Camilo Perez, Hong Zhang

The task of transferring human knowledge and capabilities to robots is still an open problem. In this paper, we address the problem of transferring human grasping locations of a particular object to a robot manipulator. Using an RGBD sensor, we propose a computer vision based method for human hand detection. This method implements a pixelwise hand detection method with the Random Forest classification algorithm in the color channel. It also creates a kernel-based hand detection method in the depth channel. Based on the theory of joint probability, it fuses both color and depth cues. As a result, this method is able to deal with noisy background and occlusion. Moreover, we apply this method to a grasping task example. In our test, the robot is able to gain the grasping knowledge from visual observation. Our method is complemented with experimental results on the settings of four different sequences with different level of difficulties, and has achieved high performance with respect to hand detection accuracy in comparison with RGB and Depth only methods.

将人类的知识和能力转移给机器人的任务仍然是一个悬而未决的问题。在本文中，我们解决了将人类对特定物体的抓取位置传递给机器人机械手的问题。利用RGBD传感器，提出了一种基于计算机视觉的手部检测方法。该方法利用随机森林分类算法在颜色通道上实现了一种像素化的手部检测方法。它还在深度通道中创建了一种基于核的手部检测方法。基于联合概率理论，它融合了颜色和深度线索。因此，该方法能够处理噪声背景和遮挡。并将该方法应用于抓取任务实例。在我们的测试中，机器人能够从视觉观察中获得抓取知识。我们的方法与四种不同难度序列设置的实验结果相补充，与RGB和Depth方法相比，我们的方法在手部检测精度方面取得了较高的性能。

引用次数: 4

Manifold Learning of Overcomplete Feature Spaces in a Multimodal Biometric Recognition System of Iris and Palmprint 虹膜和掌纹多模态生物特征识别系统中过完备特征空间的流形学习

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-01 DOI: 10.1109/CRV.2017.29

Habibeh Naderi, Behrouz Haji Soleimani, S. Matwin

This paper presents a bimodal biometric recognition system based on iris and palmprint. Different wavelet-based filters including log Gabor, Discrete Cosine Transform (DCT), Walsh and Haar are used to extract features from images. Then we fuse iris and palmprint at the feature level by concatenating the feature vectors from two modalities. Since wavelet transforms generate huge number of features, a dimensionality reduction step is necessary to make the classification and matching steps tractable and computationally feasible. In this paper, two well-known dimensionality reduction algorithms including Laplacian eigenmaps and Singular Value Decomposition (SVD) are used to reduce the size of feature space. Applying these dimensionality reduction methods not only decreases the computational cost of matching remarkably but also it improves the accuracy of recognition by reducing the unnecessary model complexity. Eventually multiple classification techniques are used in the transformed feature spaces for the final matching and recognition. CASIA datasets for iris and palmprint are used in this study. The experiments show the effectiveness of our feature level fusion method and also the dimensionality reduction methods we used. Based on our experiments, our multimodal biometric system always outperforms the unimodal recognition systems with higher accuracy. Moreover, an appropriate dimensionality reduction algorithm always helps to improve the accuracy of classifier. Finally, the log Gabor filter extracts the most discriminative features from images compared to other wavelet transforms.

提出了一种基于虹膜和掌纹的双峰生物特征识别系统。不同的基于小波的滤波器包括对数Gabor，离散余弦变换(DCT)， Walsh和Haar从图像中提取特征。然后通过连接两种模态的特征向量，在特征级上融合虹膜和掌纹。由于小波变换产生了大量的特征，为了使分类和匹配步骤易于处理和计算可行，需要一个降维步骤。本文采用拉普拉斯特征映射和奇异值分解两种著名的降维算法来减小特征空间的大小。采用这些降维方法不仅可以显著降低匹配的计算成本，还可以通过减少不必要的模型复杂度来提高识别的准确性。最后在变换后的特征空间中使用多种分类技术进行最终的匹配和识别。本研究使用CASIA虹膜和掌纹数据集。实验证明了特征级融合方法和降维方法的有效性。实验结果表明，我们的多模态生物识别系统在精度上总是优于单模态识别系统。此外，适当的降维算法往往有助于提高分类器的准确率。最后，与其他小波变换相比，对数Gabor滤波器从图像中提取出最具判别性的特征。

{"title":"Manifold Learning of Overcomplete Feature Spaces in a Multimodal Biometric Recognition System of Iris and Palmprint","authors":"Habibeh Naderi, Behrouz Haji Soleimani, S. Matwin","doi":"10.1109/CRV.2017.29","DOIUrl":"https://doi.org/10.1109/CRV.2017.29","url":null,"abstract":"This paper presents a bimodal biometric recognition system based on iris and palmprint. Different wavelet-based filters including log Gabor, Discrete Cosine Transform (DCT), Walsh and Haar are used to extract features from images. Then we fuse iris and palmprint at the feature level by concatenating the feature vectors from two modalities. Since wavelet transforms generate huge number of features, a dimensionality reduction step is necessary to make the classification and matching steps tractable and computationally feasible. In this paper, two well-known dimensionality reduction algorithms including Laplacian eigenmaps and Singular Value Decomposition (SVD) are used to reduce the size of feature space. Applying these dimensionality reduction methods not only decreases the computational cost of matching remarkably but also it improves the accuracy of recognition by reducing the unnecessary model complexity. Eventually multiple classification techniques are used in the transformed feature spaces for the final matching and recognition. CASIA datasets for iris and palmprint are used in this study. The experiments show the effectiveness of our feature level fusion method and also the dimensionality reduction methods we used. Based on our experiments, our multimodal biometric system always outperforms the unimodal recognition systems with higher accuracy. Moreover, an appropriate dimensionality reduction algorithm always helps to improve the accuracy of classifier. Finally, the log Gabor filter extracts the most discriminative features from images compared to other wavelet transforms.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123741333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Compact Environment-Invariant Codes for Robust Visual Place Recognition 鲁棒视觉位置识别的紧凑环境不变性编码

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-01 DOI: 10.1109/CRV.2017.22

Unnat Jain, Vinay P. Namboodiri, Gaurav Pandey

Robust visual place recognition (VPR) requires scene representations that are invariant to various environmental challenges such as seasonal changes and variations due to ambient lighting conditions during day and night. Moreover, a practical VPR system necessitates compact representations of environmental features. To satisfy these requirements, in this paper we suggest a modification to the existing pipeline of VPR systems to incorporate supervised hashing. The modified system learns (in a supervised setting) compact binary codes from image feature descriptors. These binary codes imbibe robustness to the visual variations exposed to it during the training phase, thereby, making the system adaptive to severe environmental changes. Also, incorporating supervised hashing makes VPR computationally more efficient and easy to implement on simple hardware. This is because binary embeddings can be learned over simple-to-compute features and the distance computation is also in the low dimensional hamming space of binary codes. We have performed experiments on several challenging data sets covering seasonal, illumination and viewpoint variations. We also compare two widely used supervised hashing methods of CCAITQ [1] and MLH [1] and show that this new pipeline out-performs or closely matches the state-of-the-art deep learning VPR methods that are based on high-dimensional features extracted from pre-trained deep convolutional neural networks.

鲁棒视觉位置识别(VPR)需要对各种环境挑战(如季节变化和昼夜环境照明条件的变化)保持不变的场景表示。此外，一个实用的VPR系统需要环境特征的紧凑表示。为了满足这些要求，本文建议对现有VPR系统的管道进行修改，以加入监督哈希。改进后的系统(在监督设置下)从图像特征描述符中学习紧凑的二进制代码。在训练阶段，这些二进制代码对暴露在它面前的视觉变化具有鲁棒性，从而使系统能够适应严重的环境变化。此外，结合监督散列使VPR计算更有效，并且易于在简单的硬件上实现。这是因为二进制嵌入可以通过简单计算的特征来学习，并且距离计算也是在二进制码的低维汉明空间中进行的。我们在几个具有挑战性的数据集上进行了实验，这些数据集涵盖了季节、光照和视点变化。我们还比较了CCAITQ[1]和MLH[1]两种广泛使用的监督哈希方法，并表明这种新的管道优于或接近最先进的深度学习VPR方法，该方法基于从预训练的深度卷积神经网络中提取的高维特征。

{"title":"Compact Environment-Invariant Codes for Robust Visual Place Recognition","authors":"Unnat Jain, Vinay P. Namboodiri, Gaurav Pandey","doi":"10.1109/CRV.2017.22","DOIUrl":"https://doi.org/10.1109/CRV.2017.22","url":null,"abstract":"Robust visual place recognition (VPR) requires scene representations that are invariant to various environmental challenges such as seasonal changes and variations due to ambient lighting conditions during day and night. Moreover, a practical VPR system necessitates compact representations of environmental features. To satisfy these requirements, in this paper we suggest a modification to the existing pipeline of VPR systems to incorporate supervised hashing. The modified system learns (in a supervised setting) compact binary codes from image feature descriptors. These binary codes imbibe robustness to the visual variations exposed to it during the training phase, thereby, making the system adaptive to severe environmental changes. Also, incorporating supervised hashing makes VPR computationally more efficient and easy to implement on simple hardware. This is because binary embeddings can be learned over simple-to-compute features and the distance computation is also in the low dimensional hamming space of binary codes. We have performed experiments on several challenging data sets covering seasonal, illumination and viewpoint variations. We also compare two widely used supervised hashing methods of CCAITQ [1] and MLH [1] and show that this new pipeline out-performs or closely matches the state-of-the-art deep learning VPR methods that are based on high-dimensional features extracted from pre-trained deep convolutional neural networks.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115525178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Computer Vision System for Virtual Rehabilitation 虚拟康复的计算机视觉系统

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-01 DOI: 10.1109/CRV.2017.30

Michael Bonenfant, D. Laurendeau, Alexis Fortin-Côté, P. Cardou, C. Gosselin, C. Faure, B. McFadyen, C. Mercier, L. Bouyer

A Kinect-based pose estimation system is presented for the study of movement problems within a rehab context. The performance of the system is compared to ground-truth data obtained by an expensive MoCap system. The results show that the proposed system performs well and could be used within a virtual rehabilitation context synchronized with other systems (e.g., robots).

提出了一种基于运动学的姿态估计系统，用于研究康复环境中的运动问题。将该系统的性能与昂贵的动作捕捉系统获得的真实数据进行了比较。结果表明，所提出的系统性能良好，可以在与其他系统(如机器人)同步的虚拟康复环境中使用。

引用次数: 5

Ready—Aim—Fly! Hands-Free Face-Based HRI for 3D Trajectory Control of UAVs Ready-Aim-Fly !基于免提人脸的无人机三维轨迹控制HRI

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-01 DOI: 10.1109/CRV.2017.39

Jake Bruce, Jacob M. Perron, R. Vaughan

We present a novel user interface for aiming andlaunching flying robots on user-defined trajectories. The methodrequires no user instrumentation and is easy to learn by analogyto a slingshot. With a few minutes of practice users can sendrobots along a desired 3D trajectory and place them in 3D space, including at high altitude and beyond line-of-sight. With the robot hovering in front of the user, the robot tracksthe user's face to estimate its relative pose. The azimuth, elevationand distance of this pose control the parameters of the robot'ssubsequent trajectory. The user triggers the robot to fly thetrajectory by making a distinct pre-trained facial expression. Wepropose three different trajectory types for different applications:straight-line, parabola, and circling. We also describe a simple training/startup interaction to selecta trajectory type and train the aiming and triggering faces. Inreal-world experiments we demonstrate and evaluate the method. We also show that the face-recognition system is resistant to inputfrom unauthorized users.

我们提出了一种新的用户界面，用于瞄准和发射自定义轨迹的飞行机器人。该方法不需要用户仪表，并且很容易通过类比弹弓来学习。只需几分钟的练习，用户就可以沿着所需的3D轨迹发送机器人，并将它们放置在3D空间中，包括高空和视线之外。当机器人在用户面前盘旋时，机器人会跟踪用户的面部来估计其相对姿势。该姿态的方位角、高度和距离控制着机器人后续轨迹的参数。用户通过做出一个独特的预先训练过的面部表情来触发机器人飞行轨迹。针对不同的应用，我们提出了三种不同的轨迹类型:直线、抛物线和盘旋。我们还描述了一个简单的训练/启动交互，以选择轨迹类型并训练瞄准和触发面。在实际实验中，我们对该方法进行了验证和评价。我们还表明，人脸识别系统对未经授权的用户的输入具有抵抗力。

引用次数: 10

Learning to Associate Words and Images Using a Large-Scale Graph 学习使用大规模图形将单词和图像关联起来

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-01 DOI: 10.1109/CRV.2017.52

Heqing Ya, Haonan Sun, Jeffrey Helt, T. Lee

We develop an approach for unsupervised learning of associations between co-occurring perceptual events using a large graph. We applied this approach to successfully solve the image captcha of China's railroad system. The approach is based on the principle of suspicious coincidence, originally proposed by Barlow [1], who argued that the brain builds a statistical model of the world by learning associations between events that repeatedly co-occur. In this particular problem, a user is presented with a deformed picture of a Chinese phrase and eight low-resolution images. They must quickly select the relevant images in order to purchase their train tickets. This problem presents several challenges: (1) the teaching labels for both the Chinese phrases and the images were not available for supervised learning, (2) no pre-trained deep convolutional neural networks are available for recognizing these Chinese phrases or the presented images, and (3) each captcha must be solved within a few seconds. We collected 2.6 million captchas, with 2.6 million deformed Chinese phrases and over 21 million images. From these data, we constructed an association graph, composed of over 6 million vertices, and linked these vertices based on co-occurrence information and feature similarity between pairs of images. We then trained a deep convolutional neural network to learn a projection of the Chinese phrases onto a 230- dimensional latent space. Using label propagation, we computed the likelihood of each of the eight images conditioned on the latent space projection of the deformed phrase for each captcha. The resulting system solved captchas with 77% accuracy in 2 seconds on average. Our work, in answering this practical challenge, illustrates the power of this class of unsupervised association learning techniques, which may be related to the brain's general strategy for associating language stimuli with visual objects on the principle of suspicious coincidence.

我们开发了一种使用大图对共同发生的感知事件之间的关联进行无监督学习的方法。我们应用这种方法成功地解决了中国铁路系统的图像验证码。该方法基于可疑巧合原理，最初由Barlow[1]提出，他认为大脑通过学习反复共同发生的事件之间的关联来建立世界的统计模型。在这个特殊的问题中，用户会看到一张变形的中文短语图片和八张低分辨率的图片。他们必须快速选择相关图像，以便购买火车票。这个问题提出了几个挑战:(1)中文短语和图像的教学标签无法用于监督学习，(2)没有预训练的深度卷积神经网络可用于识别这些中文短语或呈现的图像，(3)每个验证码必须在几秒钟内解决。我们收集了260万个验证码，260万个变形的中文短语和2100多万张图片。从这些数据中，我们构建了一个由600多万个顶点组成的关联图，并基于图像对之间的共现信息和特征相似性将这些顶点连接起来。然后，我们训练了一个深度卷积神经网络来学习中文短语在230维潜在空间上的投影。使用标签传播，我们计算了每个验证码的变形短语的潜在空间投影条件下的八个图像中的每个图像的可能性。由此产生的系统平均在2秒内以77%的准确率解决了验证码。在回答这一实际挑战时，我们的工作说明了这类无监督联想学习技术的力量，这可能与大脑根据可疑巧合原则将语言刺激与视觉对象联系起来的一般策略有关。

{"title":"Learning to Associate Words and Images Using a Large-Scale Graph","authors":"Heqing Ya, Haonan Sun, Jeffrey Helt, T. Lee","doi":"10.1109/CRV.2017.52","DOIUrl":"https://doi.org/10.1109/CRV.2017.52","url":null,"abstract":"We develop an approach for unsupervised learning of associations between co-occurring perceptual events using a large graph. We applied this approach to successfully solve the image captcha of China's railroad system. The approach is based on the principle of suspicious coincidence, originally proposed by Barlow [1], who argued that the brain builds a statistical model of the world by learning associations between events that repeatedly co-occur. In this particular problem, a user is presented with a deformed picture of a Chinese phrase and eight low-resolution images. They must quickly select the relevant images in order to purchase their train tickets. This problem presents several challenges: (1) the teaching labels for both the Chinese phrases and the images were not available for supervised learning, (2) no pre-trained deep convolutional neural networks are available for recognizing these Chinese phrases or the presented images, and (3) each captcha must be solved within a few seconds. We collected 2.6 million captchas, with 2.6 million deformed Chinese phrases and over 21 million images. From these data, we constructed an association graph, composed of over 6 million vertices, and linked these vertices based on co-occurrence information and feature similarity between pairs of images. We then trained a deep convolutional neural network to learn a projection of the Chinese phrases onto a 230- dimensional latent space. Using label propagation, we computed the likelihood of each of the eight images conditioned on the latent space projection of the deformed phrase for each captcha. The resulting system solved captchas with 77% accuracy in 2 seconds on average. Our work, in answering this practical challenge, illustrates the power of this class of unsupervised association learning techniques, which may be related to the brain's general strategy for associating language stimuli with visual objects on the principle of suspicious coincidence.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130242333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Learning Robust Object Recognition Using Composed Scenes from Generative Models 使用生成模型合成场景学习鲁棒目标识别

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-01 DOI: 10.1109/CRV.2017.42

Hao Wang, Xingyu Lin, Yimeng Zhang, T. Lee

Recurrent feedback connections in the mammalian visual system have been hypothesized to play a role in synthesizing input in the theoretical framework of analysis by synthesis. The comparison of internally synthesized representation with that of the input provides a validation mechanism during perceptual inference and learning. Inspired by these ideas, we proposed that the synthesis machinery can compose new, unobserved images by imagination to train the network itself so as to increase the robustness of the system in novel scenarios. As a proof of concept, we investigated whether images composed by imagination could help an object recognition system to deal with occlusion, which is challenging for the current state-of-the-art deep convolutional neural networks. We fine-tuned a network on images containing objects in various occlusion scenarios, that are imagined or self-generated through a deep generator network. Trained on imagined occluded scenarios under the object persistence constraint, our network discovered more subtle and localized image features that were neglected by the original network for object classification, obtaining better separability of different object classes in the feature space. This leads to significant improvement of object recognition under occlusion for our network relative to the original network trained only on un-occluded images. In addition to providing practical benefits in object recognition under occlusion, this work demonstrates the use of self-generated composition of visual scenes through the synthesis loop, combined with the object persistence constraint, can provide opportunities for neural networks to discover new relevant patterns in the data, and become more flexible in dealing with novel situations.

在合成分析的理论框架中，哺乳动物视觉系统中的循环反馈连接被假设在合成输入中发挥作用。在感知推理和学习过程中，内部合成表征与输入表征的比较提供了一种验证机制。受这些想法的启发，我们提出了合成机器可以通过想象组成新的、未观察到的图像来训练网络本身，从而增加系统在新场景下的鲁棒性。作为概念验证，我们研究了由想象组成的图像是否可以帮助物体识别系统处理遮挡，这对当前最先进的深度卷积神经网络来说是一个挑战。我们对包含各种遮挡场景中的物体的图像进行了微调，这些遮挡场景是想象的，或者是通过深度生成器网络自生成的。我们的网络在对象持久性约束下对想象的遮挡场景进行训练，发现了更多被原始网络忽略的、局部化的图像特征用于对象分类，在特征空间中获得了更好的不同对象类别的可分性。这使得我们的网络相对于只在未遮挡的图像上训练的原始网络在遮挡下的目标识别有了显著的提高。除了为遮挡下的目标识别提供实际好处外，这项工作还展示了通过合成循环使用视觉场景的自生成组合，结合对象持久性约束，可以为神经网络提供机会，在数据中发现新的相关模式，并在处理新情况时变得更加灵活。

{"title":"Learning Robust Object Recognition Using Composed Scenes from Generative Models","authors":"Hao Wang, Xingyu Lin, Yimeng Zhang, T. Lee","doi":"10.1109/CRV.2017.42","DOIUrl":"https://doi.org/10.1109/CRV.2017.42","url":null,"abstract":"Recurrent feedback connections in the mammalian visual system have been hypothesized to play a role in synthesizing input in the theoretical framework of analysis by synthesis. The comparison of internally synthesized representation with that of the input provides a validation mechanism during perceptual inference and learning. Inspired by these ideas, we proposed that the synthesis machinery can compose new, unobserved images by imagination to train the network itself so as to increase the robustness of the system in novel scenarios. As a proof of concept, we investigated whether images composed by imagination could help an object recognition system to deal with occlusion, which is challenging for the current state-of-the-art deep convolutional neural networks. We fine-tuned a network on images containing objects in various occlusion scenarios, that are imagined or self-generated through a deep generator network. Trained on imagined occluded scenarios under the object persistence constraint, our network discovered more subtle and localized image features that were neglected by the original network for object classification, obtaining better separability of different object classes in the feature space. This leads to significant improvement of object recognition under occlusion for our network relative to the original network trained only on un-occluded images. In addition to providing practical benefits in object recognition under occlusion, this work demonstrates the use of self-generated composition of visual scenes through the synthesis loop, combined with the object persistence constraint, can provide opportunities for neural networks to discover new relevant patterns in the data, and become more flexible in dealing with novel situations.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122330857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Leveraging Tree Statistics for Extracting Anatomical Trees from 3D Medical Images 利用树统计从3D医学图像中提取解剖树

2017 14th Conference on Computer and Robot Vision (CRV)

Pub Date : 2017-05-01 DOI: 10.1109/CRV.2017.15

Mengliu Zhao, Brandon Miles, G. Hamarneh

Using different priors (e.g. shape and appearance) have proven critical for robust image segmentation of different types of target objects. Many existing methods for extracting trees (e.g. vascular or airway trees) from medical images have leveraged appearance priors (e.g. tubular-ness and bifurcationness) and the knowledge of the cross-sectional geometry (e.g. circles or ellipses) of the tree-forming tubes. In this work, we present the first method for 3D tree extraction from 3D medical images (e.g. CT or MRI) that, in addition to appearance and cross-sectional geometry priors, utilizes prior tree statistics collected from the training data. Our tree extraction method collects and leverages topological tree prior and geometrical statistics, including tree hierarchy, branch angle and length statistics. Our implementation takes the form of a Bayesian tree centerline tracking method combining the aforementioned tree priors with observed image data. We evaluated our method on both synthetic 3D datasets and real clinical CT chest datasets. For synthetic data, our method's key feature of incorporating tree priors resulted in at least 13% increase in correctly detected branches under different noise levels. For real clinical scans, the mean distance from ground truth centerlines to the detected centerlines by our method was improved by 12% when utilizing tree priors. Both experiments validate that, by incorporating tree statistics, our tree extraction method becomes more robust to noise and provides more accurate branch localization.

使用不同的先验(例如形状和外观)已被证明对不同类型目标物体的鲁棒图像分割至关重要。从医学图像中提取树(例如血管或气道树)的许多现有方法都利用了外观先验(例如管状和分岔)和树形管的横截面几何知识(例如圆或椭圆)。在这项工作中，我们提出了从3D医学图像(例如CT或MRI)中提取3D树的第一种方法，除了外观和横截面几何先验之外，还利用了从训练数据中收集的先验树统计数据。我们的树提取方法收集并利用了拓扑树的先验统计和几何统计，包括树的层次、分支角度和长度统计。我们的实现采用贝叶斯树中心线跟踪方法的形式，结合前面提到的树先验和观察到的图像数据。我们在合成的三维数据集和真实的临床CT胸部数据集上评估了我们的方法。对于合成数据，我们的方法的关键特征是结合了树先验，在不同的噪声水平下，正确检测到的分支至少增加了13%。对于真实的临床扫描，当使用树先验时，我们的方法从地面真实中心线到检测到的中心线的平均距离提高了12%。两个实验都验证了，通过结合树统计，我们的树提取方法对噪声的鲁棒性更强，并且提供了更准确的分支定位。

{"title":"Leveraging Tree Statistics for Extracting Anatomical Trees from 3D Medical Images","authors":"Mengliu Zhao, Brandon Miles, G. Hamarneh","doi":"10.1109/CRV.2017.15","DOIUrl":"https://doi.org/10.1109/CRV.2017.15","url":null,"abstract":"Using different priors (e.g. shape and appearance) have proven critical for robust image segmentation of different types of target objects. Many existing methods for extracting trees (e.g. vascular or airway trees) from medical images have leveraged appearance priors (e.g. tubular-ness and bifurcationness) and the knowledge of the cross-sectional geometry (e.g. circles or ellipses) of the tree-forming tubes. In this work, we present the first method for 3D tree extraction from 3D medical images (e.g. CT or MRI) that, in addition to appearance and cross-sectional geometry priors, utilizes prior tree statistics collected from the training data. Our tree extraction method collects and leverages topological tree prior and geometrical statistics, including tree hierarchy, branch angle and length statistics. Our implementation takes the form of a Bayesian tree centerline tracking method combining the aforementioned tree priors with observed image data. We evaluated our method on both synthetic 3D datasets and real clinical CT chest datasets. For synthetic data, our method's key feature of incorporating tree priors resulted in at least 13% increase in correctly detected branches under different noise levels. For real clinical scans, the mean distance from ground truth centerlines to the detected centerlines by our method was improved by 12% when utilizing tree priors. Both experiments validate that, by incorporating tree statistics, our tree extraction method becomes more robust to noise and provides more accurate branch localization.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123548186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7