2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文中文

Predicting Behaviors of Basketball Players from First Person Videos 从第一人称视频预测篮球运动员的行为

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.133

Shan Su, J. Hong, Jianbo Shi, H. Park

This paper presents a method to predict the future movements (location and gaze direction) of basketball players as a whole from their first person videos. The predicted behaviors reflect an individual physical space that affords to take the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is to use the 3D reconstruction of multiple first person cameras to automatically annotate each others visual semantics of social configurations. We leverage two learning signals uniquely embedded in first person videos. Individually, a first person video records the visual semantics of a spatial and social layout around a person that allows associating with past similar situations. Collectively, first person videos follow joint attention that can link the individuals to a group. We learn the egocentric visual semantics of group movements using a Siamese neural network to retrieve future trajectories. We consolidate the retrieved trajectories from all players by maximizing a measure of social compatibility—the gaze alignment towards joint attention predicted by their social formation, where the dynamics of joint attention is learned by a long-term recurrent convolutional network. This allows us to characterize which social configuration is more plausible and predict future group trajectories.

本文提出了一种从篮球运动员的第一人称视频中整体预测其未来动作(位置和注视方向)的方法。预测的行为反映了个体的物理空间，可以采取下一步行动，同时通过参与共同关注来遵守社会行为。我们的关键创新是使用多个第一人称摄像机的3D重建来自动注释彼此的社会配置的视觉语义。我们利用了第一人称视频中独特的两种学习信号。单独来说，第一人称视频记录了一个人周围的空间和社会布局的视觉语义，允许将其与过去的类似情况联系起来。总的来说，第一人称视频遵循共同关注，可以将个人与群体联系起来。我们学习自我中心的视觉语义群体运动使用暹罗神经网络检索未来的轨迹。我们通过最大化社会相容性(social compatibility—)来巩固所有参与者的检索轨迹;由他们的社会形态预测的共同注意的注视对齐，其中共同注意的动态是通过长期循环卷积网络学习的。这使我们能够确定哪种社会结构更合理，并预测未来的群体轨迹。

{"title":"Predicting Behaviors of Basketball Players from First Person Videos","authors":"Shan Su, J. Hong, Jianbo Shi, H. Park","doi":"10.1109/CVPR.2017.133","DOIUrl":"https://doi.org/10.1109/CVPR.2017.133","url":null,"abstract":"This paper presents a method to predict the future movements (location and gaze direction) of basketball players as a whole from their first person videos. The predicted behaviors reflect an individual physical space that affords to take the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is to use the 3D reconstruction of multiple first person cameras to automatically annotate each others visual semantics of social configurations. We leverage two learning signals uniquely embedded in first person videos. Individually, a first person video records the visual semantics of a spatial and social layout around a person that allows associating with past similar situations. Collectively, first person videos follow joint attention that can link the individuals to a group. We learn the egocentric visual semantics of group movements using a Siamese neural network to retrieve future trajectories. We consolidate the retrieved trajectories from all players by maximizing a measure of social compatibility—the gaze alignment towards joint attention predicted by their social formation, where the dynamics of joint attention is learned by a long-term recurrent convolutional network. This allows us to characterize which social configuration is more plausible and predict future group trajectories.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"60 1","pages":"1206-1215"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76682009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Kernel Pooling for Convolutional Neural Networks 卷积神经网络的核池化

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.325

Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, Serge J. Belongie

Convolutional Neural Networks (CNNs) with Bilinear Pooling, initially in their full form and later using compact representations, have yielded impressive performance gains on a wide range of visual tasks, including fine-grained visual categorization, visual question answering, face recognition, and description of texture and style. The key to their success lies in the spatially invariant modeling of pairwise (2nd order) feature interactions. In this work, we propose a general pooling framework that captures higher order interactions of features in the form of kernels. We demonstrate how to approximate kernels such as Gaussian RBF up to a given order using compact explicit feature maps in a parameter-free manner. Combined with CNNs, the composition of the kernel can be learned from data in an end-to-end fashion via error back-propagation. The proposed kernel pooling scheme is evaluated in terms of both kernel approximation error and visual recognition accuracy. Experimental evaluations demonstrate state-of-the-art performance on commonly used fine-grained recognition datasets.

具有双线性池的卷积神经网络(cnn)，最初是完整的形式，后来使用紧凑的表示，在广泛的视觉任务上取得了令人印象深刻的性能提升，包括细粒度视觉分类、视觉问答、人脸识别以及纹理和风格描述。其成功的关键在于对两两(二阶)特征相互作用的空间不变建模。在这项工作中，我们提出了一个通用的池化框架，以核的形式捕获特征的高阶相互作用。我们演示了如何以无参数的方式使用紧凑的显式特征映射来近似高斯RBF等核函数到给定的阶数。结合cnn，可以通过误差反向传播以端到端的方式从数据中学习内核的组成。从核逼近误差和视觉识别精度两方面对所提出的核池化方案进行了评价。实验评估证明了在常用的细粒度识别数据集上的最先进性能。

引用次数: 278

WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation 用于图像分类、点定位和分割的深度卷积神经网络弱监督学习

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.631

Thibaut Durand, Taylor Mordan, Nicolas Thome, M. Cord

This paper introduces WILDCAT, a deep learning method which jointly aims at aligning image regions for gaining spatial invariance and learning strongly localized features. Our model is trained using only global image labels and is devoted to three main visual recognition tasks: image classification, weakly supervised object localization and semantic segmentation. WILDCAT extends state-of-the-art Convolutional Neural Networks at three main levels: the use of Fully Convolutional Networks for maintaining spatial resolution, the explicit design in the network of local features related to different class modalities, and a new way to pool these features to provide a global image prediction required for weakly supervised training. Extensive experiments show that our model significantly outperforms state-of-the-art methods.

本文介绍了一种将图像区域对齐以获得空间不变性和学习强局部特征相结合的深度学习方法WILDCAT。我们的模型仅使用全局图像标签进行训练，并致力于三个主要的视觉识别任务:图像分类、弱监督对象定位和语义分割。WILDCAT在三个主要层面上扩展了最先进的卷积神经网络:使用全卷积网络来维持空间分辨率，在网络中明确设计与不同类别模态相关的局部特征，以及一种新的方法来汇集这些特征，以提供弱监督训练所需的全局图像预测。大量实验表明，我们的模型明显优于最先进的方法。

引用次数: 300

Cross-Modality Binary Code Learning via Fusion Similarity Hashing 基于融合相似哈希的跨模态二进制码学习

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.672

Hong Liu, R. Ji, Yongjian Wu, Feiyue Huang, Baochang Zhang

Binary code learning has been emerging topic in large-scale cross-modality retrieval recently. It aims to map features from multiple modalities into a common Hamming space, where the cross-modality similarity can be approximated efficiently via Hamming distance. To this end, most existing works learn binary codes directly from data instances in multiple modalities, which preserve both intra-and inter-modal similarities respectively. Few methods consider to preserve the fusion similarity among multi-modal instances instead, which can explicitly capture their heterogeneous correlation in cross-modality retrieval. In this paper, we propose a hashing scheme, termed Fusion Similarity Hashing (FSH), which explicitly embeds the graph-based fusion similarity across modalities into a common Hamming space. Inspired by the fusion by diffusion, our core idea is to construct an undirected asymmetric graph to model the fusion similarity among different modalities, upon which a graph hashing scheme with alternating optimization is introduced to learn binary codes that embeds such fusion similarity. Quantitative evaluations on three widely used benchmarks, i.e., UCI Handwritten Digit, MIR-Flickr25K and NUS-WIDE, demonstrate that the proposed FSH approach can achieve superior performance over the state-of-the-art methods.

二进制码学习是近年来大规模跨模态检索研究的一个新兴课题。它旨在将多个模态的特征映射到一个共同的汉明空间中，在这个空间中，跨模态的相似性可以通过汉明距离有效地近似。为此，大多数现有工作直接从多个模态的数据实例中学习二进制代码，这分别保留了模态内和模态间的相似性。很少有方法考虑保留多模态实例之间的融合相似性，从而在跨模态检索中明确地捕获它们的异构相关性。在本文中，我们提出了一种称为融合相似哈希(FSH)的哈希方案，该方案显式地将基于图的跨模态融合相似嵌入到公共汉明空间中。受扩散融合的启发，我们的核心思想是构造一个无向非对称图来模拟不同模态之间的融合相似度，在此基础上引入交替优化的图哈希方案来学习嵌入这种融合相似度的二进制码。对三个广泛使用的基准(即UCI手写数字，MIR-Flickr25K和NUS-WIDE)的定量评估表明，所提出的FSH方法可以比最先进的方法取得更好的性能。

{"title":"Cross-Modality Binary Code Learning via Fusion Similarity Hashing","authors":"Hong Liu, R. Ji, Yongjian Wu, Feiyue Huang, Baochang Zhang","doi":"10.1109/CVPR.2017.672","DOIUrl":"https://doi.org/10.1109/CVPR.2017.672","url":null,"abstract":"Binary code learning has been emerging topic in large-scale cross-modality retrieval recently. It aims to map features from multiple modalities into a common Hamming space, where the cross-modality similarity can be approximated efficiently via Hamming distance. To this end, most existing works learn binary codes directly from data instances in multiple modalities, which preserve both intra-and inter-modal similarities respectively. Few methods consider to preserve the fusion similarity among multi-modal instances instead, which can explicitly capture their heterogeneous correlation in cross-modality retrieval. In this paper, we propose a hashing scheme, termed Fusion Similarity Hashing (FSH), which explicitly embeds the graph-based fusion similarity across modalities into a common Hamming space. Inspired by the fusion by diffusion, our core idea is to construct an undirected asymmetric graph to model the fusion similarity among different modalities, upon which a graph hashing scheme with alternating optimization is introduced to learn binary codes that embeds such fusion similarity. Quantitative evaluations on three widely used benchmarks, i.e., UCI Handwritten Digit, MIR-Flickr25K and NUS-WIDE, demonstrate that the proposed FSH approach can achieve superior performance over the state-of-the-art methods.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"56 1","pages":"6345-6353"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89117114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 146

Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation 深度混合线性逆回归在头姿估计中的应用

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.756

Stéphane Lathuilière, Rémi Juge, P. Mesejo, R. Muñoz-Salinas, R. Horaud

Convolutional Neural Networks (ConvNets) have become the state-of-the-art for many classification and regression problems in computer vision. When it comes to regression, approaches such as measuring the Euclidean distance of target and predictions are often employed as output layer. In this paper, we propose the coupling of a Gaussian mixture of linear inverse regressions with a ConvNet, and we describe the methodological foundations and the associated algorithm to jointly train the deep network and the regression function. We test our model on the head-pose estimation problem. In this particular problem, we show that inverse regression outperforms regression models currently used by state-of-the-art computer vision methods. Our method does not require the incorporation of additional data, as it is often proposed in the literature, thus it is able to work well on relatively small training datasets. Finally, it outperforms state-of-the-art methods in head-pose estimation using a widely used head-pose dataset. To the best of our knowledge, we are the first to incorporate inverse regression into deep learning for computer vision applications.

卷积神经网络(ConvNets)已经成为计算机视觉中许多分类和回归问题的最新技术。当涉及到回归时，通常采用测量目标和预测的欧几里得距离等方法作为输出层。在本文中，我们提出了高斯混合线性逆回归与卷积神经网络的耦合，并描述了联合训练深度网络和回归函数的方法基础和相关算法。我们在头姿估计问题上测试了我们的模型。在这个特殊的问题中，我们表明逆回归优于目前最先进的计算机视觉方法使用的回归模型。我们的方法不需要合并额外的数据，正如文献中经常提出的那样，因此它能够在相对较小的训练数据集上很好地工作。最后，它在使用广泛使用的头部姿势数据集进行头部姿势估计方面优于最先进的方法。据我们所知，我们是第一个将逆回归纳入计算机视觉应用深度学习的公司。

{"title":"Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation","authors":"Stéphane Lathuilière, Rémi Juge, P. Mesejo, R. Muñoz-Salinas, R. Horaud","doi":"10.1109/CVPR.2017.756","DOIUrl":"https://doi.org/10.1109/CVPR.2017.756","url":null,"abstract":"Convolutional Neural Networks (ConvNets) have become the state-of-the-art for many classification and regression problems in computer vision. When it comes to regression, approaches such as measuring the Euclidean distance of target and predictions are often employed as output layer. In this paper, we propose the coupling of a Gaussian mixture of linear inverse regressions with a ConvNet, and we describe the methodological foundations and the associated algorithm to jointly train the deep network and the regression function. We test our model on the head-pose estimation problem. In this particular problem, we show that inverse regression outperforms regression models currently used by state-of-the-art computer vision methods. Our method does not require the incorporation of additional data, as it is often proposed in the literature, thus it is able to work well on relatively small training datasets. Finally, it outperforms state-of-the-art methods in head-pose estimation using a widely used head-pose dataset. To the best of our knowledge, we are the first to incorporate inverse regression into deep learning for computer vision applications.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"30 1","pages":"7149-7157"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90350020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Zero-Shot Action Recognition with Error-Correcting Output Codes 带有纠错输出码的零射击动作识别

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.117

Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, Yunhong Wang

Recently, zero-shot action recognition (ZSAR) has emerged with the explosive growth of action categories. In this paper, we explore ZSAR from a novel perspective by adopting the Error-Correcting Output Codes (dubbed ZSECOC). Our ZSECOC equips the conventional ECOC with the additional capability of ZSAR, by addressing the domain shift problem. In particular, we learn discriminative ZSECOC for seen categories from both category-level semantics and intrinsic data structures. This procedure deals with domain shift implicitly by transferring the well-established correlations among seen categories to unseen ones. Moreover, a simple semantic transfer strategy is developed for explicitly transforming the learned embeddings of seen categories to better fit the underlying structure of unseen categories. As a consequence, our ZSECOC inherits the promising characteristics from ECOC as well as overcomes domain shift, making it more discriminative for ZSAR. We systematically evaluate ZSECOC on three realistic action benchmarks, i.e. Olympic Sports, HMDB51 and UCF101. The experimental results clearly show the superiority of ZSECOC over the state-of-the-art methods.

近年来，随着动作类别的爆发式增长，零射击动作识别(zero-shot action recognition, ZSAR)应运而生。本文采用纠错输出码(Error-Correcting Output Codes，简称ZSECOC)，从一个全新的角度对ZSAR进行了研究。我们的ZSECOC通过解决域漂移问题，为传统的ECOC提供了ZSAR的额外能力。特别是，我们从类别级语义和内在数据结构两方面学习了已见类别的判别性ZSECOC。该过程通过将已建立的可见类别之间的相关性转移到未见类别之间，隐式地处理域移位。此外，开发了一种简单的语义转移策略，用于显式转换已见类别的学习嵌入，以更好地适应未见类别的底层结构。因此，我们的ZSECOC继承了ECOC的有利特征，并克服了域移，使其对ZSAR具有更强的辨别能力。我们以奥林匹克体育、HMDB51和UCF101这三个现实行动基准对ZSECOC进行了系统评价。实验结果清楚地表明，ZSECOC方法优于目前最先进的方法。

{"title":"Zero-Shot Action Recognition with Error-Correcting Output Codes","authors":"Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, Yunhong Wang","doi":"10.1109/CVPR.2017.117","DOIUrl":"https://doi.org/10.1109/CVPR.2017.117","url":null,"abstract":"Recently, zero-shot action recognition (ZSAR) has emerged with the explosive growth of action categories. In this paper, we explore ZSAR from a novel perspective by adopting the Error-Correcting Output Codes (dubbed ZSECOC). Our ZSECOC equips the conventional ECOC with the additional capability of ZSAR, by addressing the domain shift problem. In particular, we learn discriminative ZSECOC for seen categories from both category-level semantics and intrinsic data structures. This procedure deals with domain shift implicitly by transferring the well-established correlations among seen categories to unseen ones. Moreover, a simple semantic transfer strategy is developed for explicitly transforming the learned embeddings of seen categories to better fit the underlying structure of unseen categories. As a consequence, our ZSECOC inherits the promising characteristics from ECOC as well as overcomes domain shift, making it more discriminative for ZSAR. We systematically evaluate ZSECOC on three realistic action benchmarks, i.e. Olympic Sports, HMDB51 and UCF101. The experimental results clearly show the superiority of ZSECOC over the state-of-the-art methods.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"94 1","pages":"1042-1051"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73931084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 136

Hyperspectral Image Super-Resolution via Non-local Sparse Tensor Factorization 基于非局部稀疏张量分解的高光谱图像超分辨率

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.411

Renwei Dian, Leyuan Fang, Shutao Li

Hyperspectral image (HSI) super-resolution, which fuses a low-resolution (LR) HSI with a high-resolution (HR) multispectral image (MSI), has recently attracted much attention. Most of the current HSI super-resolution approaches are based on matrix factorization, which unfolds the three-dimensional HSI as a matrix before processing. In general, the matrix data representation obtained after the matrix unfolding operation makes it hard to fully exploit the inherent HSI spatial-spectral structures. In this paper, a novel HSI super-resolution method based on non-local sparse tensor factorization (called as the NLSTF) is proposed. The sparse tensor factorization can directly decompose each cube of the HSI as a sparse core tensor and dictionaries of three modes, which reformulates the HSI super-resolution problem as the estimation of sparse core tensor and dictionaries for each cube. To further exploit the non-local spatial self-similarities of the HSI, similar cubes are grouped together, and they are assumed to share the same dictionaries. The dictionaries are learned from the LR-HSI and HR-MSI for each group, and corresponding sparse core tensors are estimated by spare coding on the learned dictionaries for each cube. Experimental results demonstrate the superiority of the proposed NLSTF approach over several state-of-the-art HSI super-resolution approaches.

高光谱图像(HSI)是一种融合低分辨率(LR)高光谱图像和高分辨率(HR)多光谱图像(MSI)的超分辨率图像，近年来备受关注。目前大多数恒指的超分辨率方法都是基于矩阵分解，在处理之前将三维恒指作为矩阵展开。一般来说，矩阵展开运算后得到的矩阵数据表示难以充分挖掘恒指固有的空间光谱结构。提出了一种基于非局部稀疏张量分解(NLSTF)的HSI超分辨方法。稀疏张量分解可以直接将HSI的每个立方体分解为三个模式的稀疏核张量和字典，从而将HSI超分辨率问题重新表述为每个立方体的稀疏核张量和字典的估计。为了进一步利用HSI的非局部空间自相似性，将相似的多维数据集分组在一起，并假设它们共享相同的字典。从每个组的LR-HSI和HR-MSI中学习字典，并通过对每个立方体的学习字典进行备用编码来估计相应的稀疏核张量。实验结果表明，所提出的NLSTF方法优于几种最先进的HSI超分辨率方法。

{"title":"Hyperspectral Image Super-Resolution via Non-local Sparse Tensor Factorization","authors":"Renwei Dian, Leyuan Fang, Shutao Li","doi":"10.1109/CVPR.2017.411","DOIUrl":"https://doi.org/10.1109/CVPR.2017.411","url":null,"abstract":"Hyperspectral image (HSI) super-resolution, which fuses a low-resolution (LR) HSI with a high-resolution (HR) multispectral image (MSI), has recently attracted much attention. Most of the current HSI super-resolution approaches are based on matrix factorization, which unfolds the three-dimensional HSI as a matrix before processing. In general, the matrix data representation obtained after the matrix unfolding operation makes it hard to fully exploit the inherent HSI spatial-spectral structures. In this paper, a novel HSI super-resolution method based on non-local sparse tensor factorization (called as the NLSTF) is proposed. The sparse tensor factorization can directly decompose each cube of the HSI as a sparse core tensor and dictionaries of three modes, which reformulates the HSI super-resolution problem as the estimation of sparse core tensor and dictionaries for each cube. To further exploit the non-local spatial self-similarities of the HSI, similar cubes are grouped together, and they are assumed to share the same dictionaries. The dictionaries are learned from the LR-HSI and HR-MSI for each group, and corresponding sparse core tensors are estimated by spare coding on the learned dictionaries for each cube. Experimental results demonstrate the superiority of the proposed NLSTF approach over several state-of-the-art HSI super-resolution approaches.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"248 1","pages":"3862-3871"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72955138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 199

Growing a Brain: Fine-Tuning by Increasing Model Capacity 大脑成长:通过增加模型容量进行微调

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.323

Yu-Xiong Wang, Deva Ramanan, M. Hebert

CNNs have made an undeniable impact on computer vision through the ability to learn high-capacity models with large annotated training sets. One of their remarkable properties is the ability to transfer knowledge from a large source dataset to a (typically smaller) target dataset. This is usually accomplished through fine-tuning a fixed-size network on new target data. Indeed, virtually every contemporary visual recognition system makes use of fine-tuning to transfer knowledge from ImageNet. In this work, we analyze what components and parameters change during fine-tuning, and discover that increasing model capacity allows for more natural model adaptation through fine-tuning. By making an analogy to developmental learning, we demonstrate that growing a CNN with additional units, either by widening existing layers or deepening the overall network, significantly outperforms classic fine-tuning approaches. But in order to properly grow a network, we show that newly-added units must be appropriately normalized to allow for a pace of learning that is consistent with existing units. We empirically validate our approach on several benchmark datasets, producing state-of-the-art results.

cnn通过使用大型带注释的训练集学习高容量模型的能力，对计算机视觉产生了不可否认的影响。它们的一个显著特性是能够将知识从大型源数据集转移到(通常较小的)目标数据集。这通常是通过在新目标数据上微调固定大小的网络来实现的。事实上，几乎每一个当代视觉识别系统都利用微调从ImageNet转移知识。在这项工作中，我们分析了微调过程中哪些组件和参数发生了变化，并发现增加模型容量可以通过微调实现更自然的模型适应。通过与发展性学习进行类比，我们证明了通过扩大现有层或深化整个网络来增加额外单元的CNN，显著优于经典的微调方法。但是，为了适当地发展网络，我们表明，必须适当地规范化新添加的单元，以允许与现有单元一致的学习速度。我们在几个基准数据集上验证了我们的方法，产生了最先进的结果。

{"title":"Growing a Brain: Fine-Tuning by Increasing Model Capacity","authors":"Yu-Xiong Wang, Deva Ramanan, M. Hebert","doi":"10.1109/CVPR.2017.323","DOIUrl":"https://doi.org/10.1109/CVPR.2017.323","url":null,"abstract":"CNNs have made an undeniable impact on computer vision through the ability to learn high-capacity models with large annotated training sets. One of their remarkable properties is the ability to transfer knowledge from a large source dataset to a (typically smaller) target dataset. This is usually accomplished through fine-tuning a fixed-size network on new target data. Indeed, virtually every contemporary visual recognition system makes use of fine-tuning to transfer knowledge from ImageNet. In this work, we analyze what components and parameters change during fine-tuning, and discover that increasing model capacity allows for more natural model adaptation through fine-tuning. By making an analogy to developmental learning, we demonstrate that growing a CNN with additional units, either by widening existing layers or deepening the overall network, significantly outperforms classic fine-tuning approaches. But in order to properly grow a network, we show that newly-added units must be appropriately normalized to allow for a pace of learning that is consistent with existing units. We empirically validate our approach on several benchmark datasets, producing state-of-the-art results.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"60 1","pages":"3029-3038"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79045487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 132

Subspace Clustering via Variance Regularized Ridge Regression 方差正则岭回归的子空间聚类

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.80

Chong Peng, Zhao Kang, Q. Cheng

Spectral clustering based subspace clustering methods have emerged recently. When the inputs are 2-dimensional (2D) data, most existing clustering methods convert such data to vectors as preprocessing, which severely damages spatial information of the data. In this paper, we propose a novel subspace clustering method for 2D data with enhanced capability of retaining spatial information for clustering. It seeks two projection matrices and simultaneously constructs a linear representation of the projected data, such that the sought projections help construct the most expressive representation with the most variational information. We regularize our method based on covariance matrices directly obtained from 2D data, which have much smaller size and are more computationally amiable. Moreover, to exploit nonlinear structures of the data, a nonlinear version is proposed, which constructs an adaptive manifold according to updated projections. The learning processes of projections, representation, and manifold thus mutually enhance each other, leading to a powerful data representation. Efficient optimization procedures are proposed, which generate non-increasing objective value sequence with theoretical convergence guarantee. Extensive experimental results confirm the effectiveness of proposed method.

近年来出现了基于谱聚类的子空间聚类方法。当输入是二维数据时，现有的聚类方法大多将数据转化为矢量进行预处理，严重破坏了数据的空间信息。本文提出了一种新的二维数据子空间聚类方法，增强了对空间信息的保留能力。它寻找两个投影矩阵，并同时构建投影数据的线性表示，这样所寻求的投影有助于用最易变的信息构建最具表现力的表示。我们基于直接从二维数据中获得的协方差矩阵来正则化我们的方法，它具有更小的尺寸和更易于计算。此外，为了利用数据的非线性结构，提出了一种非线性版本，该版本根据更新的投影构造自适应流形。因此，投影、表示和流形的学习过程相互增强，从而产生强大的数据表示。提出了一种有效的优化方法，可以生成具有理论收敛保证的非递增目标值序列。大量的实验结果证实了该方法的有效性。

{"title":"Subspace Clustering via Variance Regularized Ridge Regression","authors":"Chong Peng, Zhao Kang, Q. Cheng","doi":"10.1109/CVPR.2017.80","DOIUrl":"https://doi.org/10.1109/CVPR.2017.80","url":null,"abstract":"Spectral clustering based subspace clustering methods have emerged recently. When the inputs are 2-dimensional (2D) data, most existing clustering methods convert such data to vectors as preprocessing, which severely damages spatial information of the data. In this paper, we propose a novel subspace clustering method for 2D data with enhanced capability of retaining spatial information for clustering. It seeks two projection matrices and simultaneously constructs a linear representation of the projected data, such that the sought projections help construct the most expressive representation with the most variational information. We regularize our method based on covariance matrices directly obtained from 2D data, which have much smaller size and are more computationally amiable. Moreover, to exploit nonlinear structures of the data, a nonlinear version is proposed, which constructs an adaptive manifold according to updated projections. The learning processes of projections, representation, and manifold thus mutually enhance each other, leading to a powerful data representation. Efficient optimization procedures are proposed, which generate non-increasing objective value sequence with theoretical convergence guarantee. Extensive experimental results confirm the effectiveness of proposed method.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"29 11 1","pages":"682-691"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78171660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Generative Hierarchical Learning of Sparse FRAME Models 稀疏框架模型的生成层次学习

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2017-07-21 DOI: 10.1109/CVPR.2017.209

Jianwen Xie, Yifei Xu, Erik Nijkamp, Y. Wu, Song-Chun Zhu

This paper proposes a method for generative learning of hierarchical random field models. The resulting model, which we call the hierarchical sparse FRAME (Filters, Random field, And Maximum Entropy) model, is a generalization of the original sparse FRAME model by decomposing it into multiple parts that are allowed to shift their locations, scales and rotations, so that the resulting model becomes a hierarchical deformable template. The model can be trained by an EM-type algorithm that alternates the following two steps: (1) Inference: Given the current model, we match it to each training image by inferring the unknown locations, scales, and rotations of the object and its parts by recursive sum-max maps, and (2) Re-learning: Given the inferred geometric configurations of the objects and their parts, we re-learn the model parameters by maximum likelihood estimation via stochastic gradient algorithm. Experiments show that the proposed method is capable of learning meaningful and interpretable templates that can be used for object detection, classification and clustering.

提出了一种分层随机场模型的生成学习方法。我们将得到的模型称为分层稀疏FRAME (Filters, Random field, And Maximum Entropy)模型，它是对原始稀疏FRAME模型的推广，将其分解为多个部分，这些部分可以移动它们的位置、比例和旋转，从而使得到的模型成为一个分层可变形的模板。该模型可以通过em类型的算法进行训练，该算法交替进行以下两个步骤:(1)推断:给定当前模型，我们通过递归和最大映射推断物体及其部分的未知位置、尺度和旋转，将其与每个训练图像进行匹配;(2)重新学习:给定推断的物体及其部分的几何构型，我们通过随机梯度算法通过最大似然估计重新学习模型参数。实验表明，该方法能够学习有意义且可解释的模板，用于目标检测、分类和聚类。

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀