2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文中文

Information Bottleneck Learning Using Privileged Information for Visual Recognition 利用特权信息进行视觉识别的信息瓶颈学习

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.166

Saeid Motiian, Marco Piccirilli, D. Adjeroh, Gianfranco Doretto

We explore the visual recognition problem from a main data view when an auxiliary data view is available during training. This is important because it allows improving the training of visual classifiers when paired additional data is cheaply available, and it improves the recognition from multi-view data when there is a missing view at testing time. The problem is challenging because of the intrinsic asymmetry caused by the missing auxiliary view during testing. We account for such view during training by extending the information bottleneck method, and by combining it with risk minimization. In this way, we establish an information theoretic principle for leaning any type of visual classifier under this particular setting. We use this principle to design a large-margin classifier with an efficient optimization in the primal space. We extensively compare our method with the state-of-the-art on different visual recognition datasets, and with different types of auxiliary data, and show that the proposed framework has a very promising potential.

在训练过程中，当辅助数据视图可用时，我们从主数据视图探索视觉识别问题。这很重要，因为当配对的附加数据很便宜时，它可以改进视觉分类器的训练，并且当测试时存在缺失视图时，它可以改进对多视图数据的识别。由于在测试过程中缺少辅助视图导致了固有的不对称性，因此该问题具有挑战性。我们通过扩展信息瓶颈方法，并将其与风险最小化相结合，在训练过程中考虑到这种观点。通过这种方式，我们建立了在这种特定设置下学习任何类型的视觉分类器的信息论原理。我们利用这一原理设计了一个在原始空间进行高效优化的大边界分类器。我们在不同的视觉识别数据集和不同类型的辅助数据上，将我们的方法与最新的方法进行了广泛的比较，并表明所提出的框架具有非常有前途的潜力。

引用次数: 52

Joint Probabilistic Matching Using m-Best Solutions 基于m-最优解的联合概率匹配

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.22

S. H. Rezatofighi, Anton Milan, Zhen Zhang, Javen Qinfeng Shi, A. Dick, I. Reid

Matching between two sets of objects is typically approached by finding the object pairs that collectively maximize the joint matching score. In this paper, we argue that this single solution does not necessarily lead to the optimal matching accuracy and that general one-to-one assignment problems can be improved by considering multiple hypotheses before computing the final similarity measure. To that end, we propose to utilize the marginal distributions for each entity. Previously, this idea has been neglected mainly because exact marginalization is intractable due to a combinatorial number of all possible matching permutations. Here, we propose a generic approach to efficiently approximate the marginal distributions by exploiting the m-best solutions of the original problem. This approach not only improves the matching solution, but also provides more accurate ranking of the results, because of the extra information included in the marginal distribution. We validate our claim on two distinct objectives: (i) person re-identification and temporal matching modeled as an integer linear program, and (ii) feature point matching using a quadratic cost function. Our experiments confirm that marginalization indeed leads to superior performance compared to the single (nearly) optimal solution, yielding state-of-the-art results in both applications on standard benchmarks.

两组对象之间的匹配通常是通过寻找使联合匹配得分最大化的对象对来实现的。在本文中，我们认为这种单一的解决方案并不一定导致最优匹配精度，一般的一对一分配问题可以通过在计算最终相似度量之前考虑多个假设来改进。为此，我们建议利用每个实体的边际分配。在此之前，这个想法一直被忽视，主要是因为由于所有可能的匹配排列的组合数量，精确的边缘化是难以处理的。在这里，我们提出了一种利用原始问题的m-最优解来有效近似边际分布的通用方法。由于边际分布中包含了额外的信息，该方法不仅改进了匹配解，而且提供了更准确的结果排序。我们在两个不同的目标上验证了我们的主张:(i)作为整数线性程序建模的人员重新识别和时间匹配，以及(ii)使用二次成本函数的特征点匹配。我们的实验证实，与单一(接近)最优解决方案相比，边缘化确实会带来更好的性能，在标准基准测试的两个应用程序中产生最先进的结果。

{"title":"Joint Probabilistic Matching Using m-Best Solutions","authors":"S. H. Rezatofighi, Anton Milan, Zhen Zhang, Javen Qinfeng Shi, A. Dick, I. Reid","doi":"10.1109/CVPR.2016.22","DOIUrl":"https://doi.org/10.1109/CVPR.2016.22","url":null,"abstract":"Matching between two sets of objects is typically approached by finding the object pairs that collectively maximize the joint matching score. In this paper, we argue that this single solution does not necessarily lead to the optimal matching accuracy and that general one-to-one assignment problems can be improved by considering multiple hypotheses before computing the final similarity measure. To that end, we propose to utilize the marginal distributions for each entity. Previously, this idea has been neglected mainly because exact marginalization is intractable due to a combinatorial number of all possible matching permutations. Here, we propose a generic approach to efficiently approximate the marginal distributions by exploiting the m-best solutions of the original problem. This approach not only improves the matching solution, but also provides more accurate ranking of the results, because of the extra information included in the marginal distribution. We validate our claim on two distinct objectives: (i) person re-identification and temporal matching modeled as an integer linear program, and (ii) feature point matching using a quadratic cost function. Our experiments confirm that marginalization indeed leads to superior performance compared to the single (nearly) optimal solution, yielding state-of-the-art results in both applications on standard benchmarks.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"88 1","pages":"136-145"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88965242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Ambiguity Helps: Classification with Disagreements in Crowdsourced Annotations 歧义有助于:众包注释中存在分歧的分类

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.241

V. Sharmanska, D. Hernández-Lobato, José Miguel Hernández-Lobato, Novi Quadrianto

Imagine we show an image to a person and ask her/him to decide whether the scene in the image is warm or not warm, and whether it is easy or not to spot a squirrel in the image. For exactly the same image, the answers to those questions are likely to differ from person to person. This is because the task is inherently ambiguous. Such an ambiguous, therefore challenging, task is pushing the boundary of computer vision in showing what can and can not be learned from visual data. Crowdsourcing has been invaluable for collecting annotations. This is particularly so for a task that goes beyond a clear-cut dichotomy as multiple human judgments per image are needed to reach a consensus. This paper makes conceptual and technical contributions. On the conceptual side, we define disagreements among annotators as privileged information about the data instance. On the technical side, we propose a framework to incorporate annotation disagreements into the classifiers. The proposed framework is simple, relatively fast, and outperforms classifiers that do not take into account the disagreements, especially if tested on high confidence annotations.

想象一下，我们向一个人展示一张图片，并让他/她判断图片中的场景是温暖的还是不温暖的，以及在图片中是否容易发现一只松鼠。对于完全相同的图像，这些问题的答案可能因人而异。这是因为任务本身就具有模糊性。这样一个模棱两可，因此具有挑战性的任务正在推动计算机视觉的边界，以显示从视觉数据中可以学习什么和不能学习什么。众包在收集注释方面是无价的。这对于一个超越明确的二分法的任务尤其如此，因为每个图像需要多个人工判断才能达成共识。本文在概念和技术上做出了贡献。在概念方面，我们将注释者之间的分歧定义为关于数据实例的特权信息。在技术方面，我们提出了一个框架，将注释歧义合并到分类器中。所提出的框架简单、相对快速，并且优于不考虑分歧的分类器，特别是在高置信度注释上进行测试时。

{"title":"Ambiguity Helps: Classification with Disagreements in Crowdsourced Annotations","authors":"V. Sharmanska, D. Hernández-Lobato, José Miguel Hernández-Lobato, Novi Quadrianto","doi":"10.1109/CVPR.2016.241","DOIUrl":"https://doi.org/10.1109/CVPR.2016.241","url":null,"abstract":"Imagine we show an image to a person and ask her/him to decide whether the scene in the image is warm or not warm, and whether it is easy or not to spot a squirrel in the image. For exactly the same image, the answers to those questions are likely to differ from person to person. This is because the task is inherently ambiguous. Such an ambiguous, therefore challenging, task is pushing the boundary of computer vision in showing what can and can not be learned from visual data. Crowdsourcing has been invaluable for collecting annotations. This is particularly so for a task that goes beyond a clear-cut dichotomy as multiple human judgments per image are needed to reach a consensus. This paper makes conceptual and technical contributions. On the conceptual side, we define disagreements among annotators as privileged information about the data instance. On the technical side, we propose a framework to incorporate annotation disagreements into the classifiers. The proposed framework is simple, relatively fast, and outperforms classifiers that do not take into account the disagreements, especially if tested on high confidence annotations.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"52 1","pages":"2194-2202"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88977735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Automatic Fence Segmentation in Videos of Dynamic Scenes 动态场景视频中的自动围栏分割

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.83

Renjiao Yi, Jue Wang, P. Tan

We present a fully automatic approach to detect and segment fence-like occluders from a video clip. Unlike previous approaches that usually assume either static scenes or cameras, our method is capable of handling both dynamic scenes and moving cameras. Under a bottom-up framework, it first clusters pixels into coherent groups using color and motion features. These pixel groups are then analyzed in a fully connected graph, and labeled as either fence or non-fence using graph-cut optimization. Finally, we solve a dense Conditional Random Filed (CRF) constructed from multiple frames to enhance both spatial accuracy and temporal coherence of the segmentation. Once segmented, one can use existing hole-filling methods to generate a fencefree output. Extensive evaluation suggests that our method outperforms previous automatic and interactive approaches on complex examples captured by mobile devices.

我们提出了一种全自动的方法来检测和分割视频剪辑中的栅栏状遮挡物。不像以前的方法通常假设静态场景或摄像机，我们的方法能够处理动态场景和移动摄像机。在自下而上的框架下，它首先使用颜色和运动特征将像素聚类成连贯的组。然后在一个完全连接的图中分析这些像素组，并使用图切割优化标记为栅栏或非栅栏。最后，我们求解了一个由多帧图像构成的密集条件随机场(CRF)，以提高分割的空间精度和时间相干性。一旦分割，就可以使用现有的填充方法来生成无栅栏的输出。广泛的评估表明，我们的方法在移动设备捕获的复杂示例上优于以前的自动和交互式方法。

引用次数: 24

ReconNet: Non-Iterative Reconstruction of Images from Compressively Sensed Measurements ReconNet:压缩感测图像的非迭代重建

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.55

K. Kulkarni, Suhas Lohit, P. Turaga, Ronan Kerviche, A. Ashok

The goal of this paper is to present a non-iterative and more importantly an extremely fast algorithm to reconstruct images from compressively sensed (CS) random measurements. To this end, we propose a novel convolutional neural network (CNN) architecture which takes in CS measurements of an image as input and outputs an intermediate reconstruction. We call this network, ReconNet. The intermediate reconstruction is fed into an off-the-shelf denoiser to obtain the final reconstructed image. On a standard dataset of images we show significant improvements in reconstruction results (both in terms of PSNR and time complexity) over state-of-the-art iterative CS reconstruction algorithms at various measurement rates. Further, through qualitative experiments on real data collected using our block single pixel camera (SPC), we show that our network is highly robust to sensor noise and can recover visually better quality images than competitive algorithms at extremely low sensing rates of 0.1 and 0.04. To demonstrate that our algorithm can recover semantically informative images even at a low measurement rate of 0.01, we present a very robust proof of concept real-time visual tracking application.

本文的目标是提出一种非迭代的，更重要的是一种极快的算法来从压缩感知(CS)随机测量中重建图像。为此，我们提出了一种新的卷积神经网络(CNN)架构，该架构将图像的CS测量作为输入并输出中间重建。我们称这个网络为ReconNet。中间重建被送入现成的去噪器，以获得最终的重建图像。在一个标准的图像数据集上，我们展示了在不同测量速率下，与最先进的迭代CS重建算法相比，重建结果(在PSNR和时间复杂度方面)的显著改进。此外，通过使用我们的块单像素相机(SPC)收集的真实数据的定性实验，我们表明我们的网络对传感器噪声具有很高的鲁棒性，并且可以在0.1和0.04的极低感知率下恢复比竞争算法更好的视觉质量图像。为了证明我们的算法即使在0.01的低测量率下也能恢复语义信息丰富的图像，我们提出了一个非常强大的概念验证实时视觉跟踪应用程序。

{"title":"ReconNet: Non-Iterative Reconstruction of Images from Compressively Sensed Measurements","authors":"K. Kulkarni, Suhas Lohit, P. Turaga, Ronan Kerviche, A. Ashok","doi":"10.1109/CVPR.2016.55","DOIUrl":"https://doi.org/10.1109/CVPR.2016.55","url":null,"abstract":"The goal of this paper is to present a non-iterative and more importantly an extremely fast algorithm to reconstruct images from compressively sensed (CS) random measurements. To this end, we propose a novel convolutional neural network (CNN) architecture which takes in CS measurements of an image as input and outputs an intermediate reconstruction. We call this network, ReconNet. The intermediate reconstruction is fed into an off-the-shelf denoiser to obtain the final reconstructed image. On a standard dataset of images we show significant improvements in reconstruction results (both in terms of PSNR and time complexity) over state-of-the-art iterative CS reconstruction algorithms at various measurement rates. Further, through qualitative experiments on real data collected using our block single pixel camera (SPC), we show that our network is highly robust to sensor noise and can recover visually better quality images than competitive algorithms at extremely low sensing rates of 0.1 and 0.04. To demonstrate that our algorithm can recover semantically informative images even at a low measurement rate of 0.01, we present a very robust proof of concept real-time visual tracking application.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"16 1","pages":"449-458"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79655287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 515

Panoramic Stereo Videos with a Single Camera 全景立体视频与一个单一的相机

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.408

R. Aggarwal, Amrisha Vohra, A. Namboodiri

We present a practical solution for generating 360° stereo panoramic videos using a single camera. Current approaches either use a moving camera that captures multiple images of a scene, which are then stitched together to form the final panorama, or use multiple cameras that are synchronized. A moving camera limits the solution to static scenes, while multi-camera solutions require dedicated calibrated setups. Our approach improves upon the existing solutions in two significant ways: It solves the problem using a single camera, thus minimizing the calibration problem and providing us the ability to convert any digital camera into a panoramic stereo capture device. It captures all the light rays required for stereo panoramas in a single frame using a compact custom designed mirror, thus making the design practical to manufacture and easier to use. We analyze several properties of the design as well as present panoramic stereo and depth estimation results.

我们提出了一个实用的解决方案，生成360°立体全景视频使用一个单一的相机。目前的方法要么使用移动摄像机捕捉一个场景的多幅图像，然后将它们拼接在一起形成最终的全景，要么使用同步的多台摄像机。移动摄像机限制了静态场景的解决方案，而多摄像机解决方案需要专门的校准设置。我们的方法在两个重要方面改进了现有的解决方案:它解决了使用单个相机的问题，从而最大限度地减少了校准问题，并为我们提供了将任何数码相机转换为全景立体捕捉设备的能力。它使用一个紧凑的定制设计的镜子在一个框架中捕获立体全景所需的所有光线，从而使设计实用，易于制造和使用。我们分析了该设计的几个特性，并给出了全景立体和深度估计的结果。

引用次数: 26

Multicamera Calibration from Visible and Mirrored Epipoles 多相机校准从可见和镜像极点

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.367

Andrey Bushnevskiy, L. Sorgi, B. Rosenhahn

Multicamera rigs are used in a large number of 3D Vision applications, such as 3D modeling, motion capture or telepresence and a robust calibration is of utmost importance in order to achieve a high accuracy results. In many practical configurations the cameras in a rig are arranged in such a way, that they can observe each other, in other words a number of epipoles correspond to the real image points. In this paper we propose a solution for the automatic recovery of the external calibration of a multicamera system by enforcing only simple geometrical constraints, arising from the epipole visibility, without using any calibration object, such as checkerboards, laser pointers or similar. Additionally, we introduce an extension of the method that handles the case of epipoles being visible in the reflection of a planar mirror, which makes the algorithm suitable for the calibration of any multicamera system, irrespective of the number of cameras and their actual mutual visibility, and furthermore we remark that it requires only one or a few images per camera and therefore features a high speed and usability. We produce an evidence of the algorithm effectiveness by presenting a wide set of tests performed on synthetic as well as real datasets and we compare the results with those obtained using a traditional LED-based algorithm. The real datasets have been captured using a multicamera Virtual Reality (VR) rig and a spherical dome configuration for 3D reconstruction.

多摄像机平台用于大量的3D视觉应用，如3D建模，运动捕捉或远程呈现和鲁棒校准是至关重要的，以实现高精度的结果。在许多实际配置中，一个装备中的相机是这样排列的，它们可以相互观察，换句话说，许多极点对应于真实的图像点。在本文中，我们提出了一个多相机系统的外部校准的自动恢复的解决方案，通过强制简单的几何约束，由极能见度产生，不使用任何校准对象，如棋盘，激光笔或类似的。此外，我们介绍了该方法的扩展，该方法处理在平面镜反射中可见极点的情况，这使得该算法适用于任何多相机系统的校准，而不考虑相机的数量和它们的实际相互可见性，此外，我们注意到它只需要每台相机一个或几个图像，因此具有高速度和可用性。我们通过在合成数据集和真实数据集上进行广泛的测试来证明算法的有效性，并将结果与使用传统的基于led的算法获得的结果进行比较。真实的数据集是使用多摄像头虚拟现实(VR)平台和用于3D重建的球形圆顶配置捕获的。

{"title":"Multicamera Calibration from Visible and Mirrored Epipoles","authors":"Andrey Bushnevskiy, L. Sorgi, B. Rosenhahn","doi":"10.1109/CVPR.2016.367","DOIUrl":"https://doi.org/10.1109/CVPR.2016.367","url":null,"abstract":"Multicamera rigs are used in a large number of 3D Vision applications, such as 3D modeling, motion capture or telepresence and a robust calibration is of utmost importance in order to achieve a high accuracy results. In many practical configurations the cameras in a rig are arranged in such a way, that they can observe each other, in other words a number of epipoles correspond to the real image points. In this paper we propose a solution for the automatic recovery of the external calibration of a multicamera system by enforcing only simple geometrical constraints, arising from the epipole visibility, without using any calibration object, such as checkerboards, laser pointers or similar. Additionally, we introduce an extension of the method that handles the case of epipoles being visible in the reflection of a planar mirror, which makes the algorithm suitable for the calibration of any multicamera system, irrespective of the number of cameras and their actual mutual visibility, and furthermore we remark that it requires only one or a few images per camera and therefore features a high speed and usability. We produce an evidence of the algorithm effectiveness by presenting a wide set of tests performed on synthetic as well as real datasets and we compare the results with those obtained using a traditional LED-based algorithm. The real datasets have been captured using a multicamera Virtual Reality (VR) rig and a spherical dome configuration for 3D reconstruction.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 Suppl 10 1","pages":"3373-3381"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89672979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection 用于细粒度动作检测的多流双向递归神经网络

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.216

Bharat Singh, Tim K. Marks, Michael J. Jones, Oncel Tuzel, Ming Shao

We present a multi-stream bi-directional recurrent neural network for fine-grained action detection. Recently, twostream convolutional neural networks (CNNs) trained on stacked optical flow and image frames have been successful for action recognition in videos. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion and also suppresses background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. Our motion streams use pixel trajectories of a frame as raw features, in which the displacement values corresponding to a moving scene point are at the same spatial position across several frames. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label. We test on two action detection datasets: the MPII Cooking 2 Dataset, and a new MERL Shopping Dataset that we introduce and make available to the community with this paper. The results demonstrate that our method significantly outperforms state-of-the-art action detection methods on both datasets.

提出了一种用于细粒度动作检测的多流双向递归神经网络。近年来，基于堆叠光流和图像帧训练的双流卷积神经网络(cnn)在视频动作识别中取得了成功。我们的系统使用跟踪算法来定位人周围的边界框，这为外观和运动提供了参考框架，并且还抑制了不在边界框内的背景噪声。我们训练两个额外的流的运动和外观裁剪到跟踪的边界框，以及全帧流。我们的运动流使用帧的像素轨迹作为原始特征，其中移动场景点对应的位移值在多个帧中处于相同的空间位置。为了模拟动作内部和动作之间的长期动态，多流CNN之后是一个双向长短期记忆(LSTM)层。我们表明，我们的双向LSTM网络利用大约8秒的视频序列来预测动作标签。我们在两个动作检测数据集上进行了测试:MPII烹饪2数据集，以及我们在本文中介绍并向社区提供的新的MERL购物数据集。结果表明，我们的方法在两个数据集上都明显优于最先进的动作检测方法。

{"title":"A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection","authors":"Bharat Singh, Tim K. Marks, Michael J. Jones, Oncel Tuzel, Ming Shao","doi":"10.1109/CVPR.2016.216","DOIUrl":"https://doi.org/10.1109/CVPR.2016.216","url":null,"abstract":"We present a multi-stream bi-directional recurrent neural network for fine-grained action detection. Recently, twostream convolutional neural networks (CNNs) trained on stacked optical flow and image frames have been successful for action recognition in videos. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion and also suppresses background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. Our motion streams use pixel trajectories of a frame as raw features, in which the displacement values corresponding to a moving scene point are at the same spatial position across several frames. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label. We test on two action detection datasets: the MPII Cooking 2 Dataset, and a new MERL Shopping Dataset that we introduce and make available to the community with this paper. The results demonstrate that our method significantly outperforms state-of-the-art action detection methods on both datasets.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"20 1","pages":"1961-1970"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90428901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 416

Traffic-Sign Detection and Classification in the Wild 野外交通标志检测与分类

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.232

Zhe Zhu, Dun Liang, Song-Hai Zhang, Xiaolei Huang, Baoli Li, Shimin Hu

Although promising results have been achieved in the areas of traffic-sign detection and classification, few works have provided simultaneous solutions to these two tasks for realistic real world images. We make two contributions to this problem. Firstly, we have created a large traffic-sign benchmark from 100000 Tencent Street View panoramas, going beyond previous benchmarks. It provides 100000 images containing 30000 traffic-sign instances. These images cover large variations in illuminance and weather conditions. Each traffic-sign in the benchmark is annotated with a class label, its bounding box and pixel mask. We call this benchmark Tsinghua-Tencent 100K. Secondly, we demonstrate how a robust end-to-end convolutional neural network (CNN) can simultaneously detect and classify trafficsigns. Most previous CNN image processing solutions target objects that occupy a large proportion of an image, and such networks do not work well for target objects occupying only a small fraction of an image like the traffic-signs here. Experimental results show the robustness of our network and its superiority to alternatives. The benchmark, source code and the CNN model introduced in this paper is publicly available1.

尽管在交通标志检测和分类领域已经取得了可喜的成果，但很少有作品能够同时解决这两项任务，并提供真实的现实世界图像。我们对这个问题有两个贡献。首先，我们从10万张腾讯街景全景图中创建了一个大型交通标志基准，超越了之前的基准。它提供100000张包含30000个交通标志实例的图像。这些图像涵盖了光照和天气条件的巨大变化。基准测试中的每个交通标志都用一个类标签、它的边界框和像素掩码进行注释。我们称之为清华-腾讯100K基准。其次，我们展示了鲁棒的端到端卷积神经网络(CNN)如何同时检测和分类交通标志。之前大多数CNN图像处理方案都是针对图像中占很大比例的目标，这种网络对于只占图像一小部分的目标(比如这里的交通标志)效果并不好。实验结果表明了该网络的鲁棒性和优越性。本文中引入的基准测试、源代码和CNN模型都是公开的。

{"title":"Traffic-Sign Detection and Classification in the Wild","authors":"Zhe Zhu, Dun Liang, Song-Hai Zhang, Xiaolei Huang, Baoli Li, Shimin Hu","doi":"10.1109/CVPR.2016.232","DOIUrl":"https://doi.org/10.1109/CVPR.2016.232","url":null,"abstract":"Although promising results have been achieved in the areas of traffic-sign detection and classification, few works have provided simultaneous solutions to these two tasks for realistic real world images. We make two contributions to this problem. Firstly, we have created a large traffic-sign benchmark from 100000 Tencent Street View panoramas, going beyond previous benchmarks. It provides 100000 images containing 30000 traffic-sign instances. These images cover large variations in illuminance and weather conditions. Each traffic-sign in the benchmark is annotated with a class label, its bounding box and pixel mask. We call this benchmark Tsinghua-Tencent 100K. Secondly, we demonstrate how a robust end-to-end convolutional neural network (CNN) can simultaneously detect and classify trafficsigns. Most previous CNN image processing solutions target objects that occupy a large proportion of an image, and such networks do not work well for target objects occupying only a small fraction of an image like the traffic-signs here. Experimental results show the robustness of our network and its superiority to alternatives. The benchmark, source code and the CNN model introduced in this paper is publicly available1.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"89 1","pages":"2110-2118"},"PeriodicalIF":0.0,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75223118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 567

CNN-N-Gram for HandwritingWord Recognition 手写文字识别的CNN-N-Gram

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2016-06-27 DOI: 10.1109/CVPR.2016.253

Arik Poznanski, Lior Wolf

Given an image of a handwritten word, a CNN is employed to estimate its n-gram frequency profile, which is the set of n-grams contained in the word. Frequencies for unigrams, bigrams and trigrams are estimated for the entire word and for parts of it. Canonical Correlation Analysis is then used to match the estimated profile to the true profiles of all words in a large dictionary. The CNN that is used employs several novelties such as the use of multiple fully connected branches. Applied to all commonly used handwriting recognition benchmarks, our method outperforms, by a very large margin, all existing methods.

给定一个手写单词的图像，使用CNN来估计它的n-gram频率分布，这是单词中包含的n-gram的集合。单字母、双字母和三字母的频率对整个单词和部分单词进行估计。然后使用典型相关分析将估计的特征与大型字典中所有单词的真实特征相匹配。所使用的CNN采用了一些新颖的方法，例如使用多个完全连接的分支。应用于所有常用的手写识别基准测试，我们的方法在很大程度上优于所有现有的方法。

引用次数: 155

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀