首页 > 最新文献

2017 IEEE International Conference on Computer Vision (ICCV)最新文献

英文 中文
Multi-label Image Recognition by Recurrently Discovering Attentional Regions 递归发现注意区域的多标签图像识别
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.58
Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, Liang Lin
This paper proposes a novel deep architecture to address multi-label image recognition, a fundamental and practical task towards general visual understanding. Current solutions for this task usually rely on an extra step of extracting hypothesis regions (i.e., region proposals), resulting in redundant computation and sub-optimal performance. In this work, we achieve the interpretable and contextualized multi-label image classification by developing a recurrent memorized-attention module. This module consists of two alternately performed components: i) a spatial transformer layer to locate attentional regions from the convolutional feature maps in a region-proposal-free way and ii) an LSTM (Long-Short Term Memory) sub-network to sequentially predict semantic labeling scores on the located regions while capturing the global dependencies of these regions. The LSTM also output the parameters for computing the spatial transformer. On large-scale benchmarks of multi-label image classification (e.g., MS-COCO and PASCAL VOC 07), our approach demonstrates superior performances over other existing state-of-the-arts in both accuracy and efficiency.
本文提出了一种新的深度架构来解决多标签图像识别,这是实现一般视觉理解的基础和实际任务。目前该任务的解决方案通常依赖于提取假设区域(即区域建议)的额外步骤,导致冗余计算和次优性能。在这项工作中,我们通过开发一个循环记忆注意模块来实现可解释和上下文化的多标签图像分类。该模块由两个交替执行的组件组成:i)空间转换层,用于以无区域提议的方式从卷积特征映射中定位注意区域;ii) LSTM(长短期记忆)子网络,用于顺序预测所定位区域上的语义标记分数,同时捕获这些区域的全局依赖关系。LSTM还输出用于计算空间变压器的参数。在多标签图像分类的大规模基准测试中(例如MS-COCO和PASCAL VOC 07),我们的方法在准确性和效率方面都优于其他现有的最先进的方法。
{"title":"Multi-label Image Recognition by Recurrently Discovering Attentional Regions","authors":"Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, Liang Lin","doi":"10.1109/ICCV.2017.58","DOIUrl":"https://doi.org/10.1109/ICCV.2017.58","url":null,"abstract":"This paper proposes a novel deep architecture to address multi-label image recognition, a fundamental and practical task towards general visual understanding. Current solutions for this task usually rely on an extra step of extracting hypothesis regions (i.e., region proposals), resulting in redundant computation and sub-optimal performance. In this work, we achieve the interpretable and contextualized multi-label image classification by developing a recurrent memorized-attention module. This module consists of two alternately performed components: i) a spatial transformer layer to locate attentional regions from the convolutional feature maps in a region-proposal-free way and ii) an LSTM (Long-Short Term Memory) sub-network to sequentially predict semantic labeling scores on the located regions while capturing the global dependencies of these regions. The LSTM also output the parameters for computing the spatial transformer. On large-scale benchmarks of multi-label image classification (e.g., MS-COCO and PASCAL VOC 07), our approach demonstrates superior performances over other existing state-of-the-arts in both accuracy and efficiency.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"18 1","pages":"464-472"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79276494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 238
FLaME: Fast Lightweight Mesh Estimation Using Variational Smoothing on Delaunay Graphs FLaME:基于Delaunay图变分平滑的快速轻量级网格估计
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.502
W. N. Greene, N. Roy
We propose a lightweight method for dense online monocular depth estimation capable of reconstructing 3D meshes on computationally constrained platforms. Our main contribution is to pose the reconstruction problem as a non-local variational optimization over a time-varying Delaunay graph of the scene geometry, which allows for an efficient, keyframeless approach to depth estimation. The graph can be tuned to favor reconstruction quality or speed and is continuously smoothed and augmented as the camera explores the scene. Unlike keyframe-based approaches, the optimized surface is always available at the current pose, which is necessary for low-latency obstacle avoidance. FLaME (Fast Lightweight Mesh Estimation) can generate mesh reconstructions at upwards of 230 Hz using less than one Intel i7 CPU core, which enables operation on size, weight, and power-constrained platforms. We present results from both benchmark datasets and experiments running FLaME in-the-loop onboard a small flying quadrotor.
我们提出了一种轻量级的密集在线单目深度估计方法,能够在计算受限的平台上重建三维网格。我们的主要贡献是将重建问题作为场景几何的时变Delaunay图上的非局部变分优化,这允许一种有效的,无关键帧的深度估计方法。该图形可以调整以支持重建质量或速度,并随着相机探索场景而不断平滑和增强。与基于关键帧的方法不同,优化的表面总是在当前姿态下可用,这对于低延迟避障是必要的。FLaME(快速轻量级网格估计)可以使用不到一个Intel i7 CPU内核以高达230 Hz的速度生成网格重建,这使得可以在尺寸,重量和功率受限的平台上运行。我们目前的结果从两个基准数据集和实验运行火焰在一个小型飞行四旋翼。
{"title":"FLaME: Fast Lightweight Mesh Estimation Using Variational Smoothing on Delaunay Graphs","authors":"W. N. Greene, N. Roy","doi":"10.1109/ICCV.2017.502","DOIUrl":"https://doi.org/10.1109/ICCV.2017.502","url":null,"abstract":"We propose a lightweight method for dense online monocular depth estimation capable of reconstructing 3D meshes on computationally constrained platforms. Our main contribution is to pose the reconstruction problem as a non-local variational optimization over a time-varying Delaunay graph of the scene geometry, which allows for an efficient, keyframeless approach to depth estimation. The graph can be tuned to favor reconstruction quality or speed and is continuously smoothed and augmented as the camera explores the scene. Unlike keyframe-based approaches, the optimized surface is always available at the current pose, which is necessary for low-latency obstacle avoidance. FLaME (Fast Lightweight Mesh Estimation) can generate mesh reconstructions at upwards of 230 Hz using less than one Intel i7 CPU core, which enables operation on size, weight, and power-constrained platforms. We present results from both benchmark datasets and experiments running FLaME in-the-loop onboard a small flying quadrotor.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"312 1","pages":"4696-4704"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85462833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Higher-Order Integration of Hierarchical Convolutional Activations for Fine-Grained Visual Categorization 用于细粒度视觉分类的层次卷积激活的高阶集成
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.63
Sijia Cai, W. Zuo, Lei Zhang
The success of fine-grained visual categorization (FGVC) extremely relies on the modeling of appearance and interactions of various semantic parts. This makes FGVC very challenging because: (i) part annotation and detection require expert guidance and are very expensive; (ii) parts are of different sizes; and (iii) the part interactions are complex and of higher-order. To address these issues, we propose an end-to-end framework based on higherorder integration of hierarchical convolutional activations for FGVC. By treating the convolutional activations as local descriptors, hierarchical convolutional activations can serve as a representation of local parts from different scales. A polynomial kernel based predictor is proposed to capture higher-order statistics of convolutional activations for modeling part interaction. To model inter-layer part interactions, we extend polynomial predictor to integrate hierarchical activations via kernel fusion. Our work also provides a new perspective for combining convolutional activations from multiple layers. While hypercolumns simply concatenate maps from different layers, and holistically-nested network uses weighted fusion to combine side-outputs, our approach exploits higher-order intra-layer and inter-layer relations for better integration of hierarchical convolutional features. The proposed framework yields more discriminative representation and achieves competitive results on the widely used FGVC datasets.
细粒度视觉分类(FGVC)的成功很大程度上依赖于对各种语义部分的外观和相互作用的建模。这使得FGVC非常具有挑战性,因为:(i)部分注释和检测需要专家指导,并且非常昂贵;(二)零件尺寸不同的;(3)零件相互作用复杂且高阶。为了解决这些问题,我们提出了一个基于FGVC分层卷积激活的高阶集成的端到端框架。通过将卷积激活作为局部描述符,分层卷积激活可以作为来自不同尺度的局部部分的表示。提出了一种基于多项式核的预测器来捕获卷积激活的高阶统计量,用于零件交互建模。为了模拟层间部分的相互作用,我们将多项式预测器扩展到通过核融合集成层次激活。我们的工作也为多层卷积激活的组合提供了一个新的视角。虽然超列简单地连接来自不同层的映射,并且整体嵌套网络使用加权融合来组合侧输出,但我们的方法利用高阶层内和层间关系来更好地集成分层卷积特征。所提出的框架在广泛使用的FGVC数据集上产生了更具歧视性的表示,并获得了竞争性的结果。
{"title":"Higher-Order Integration of Hierarchical Convolutional Activations for Fine-Grained Visual Categorization","authors":"Sijia Cai, W. Zuo, Lei Zhang","doi":"10.1109/ICCV.2017.63","DOIUrl":"https://doi.org/10.1109/ICCV.2017.63","url":null,"abstract":"The success of fine-grained visual categorization (FGVC) extremely relies on the modeling of appearance and interactions of various semantic parts. This makes FGVC very challenging because: (i) part annotation and detection require expert guidance and are very expensive; (ii) parts are of different sizes; and (iii) the part interactions are complex and of higher-order. To address these issues, we propose an end-to-end framework based on higherorder integration of hierarchical convolutional activations for FGVC. By treating the convolutional activations as local descriptors, hierarchical convolutional activations can serve as a representation of local parts from different scales. A polynomial kernel based predictor is proposed to capture higher-order statistics of convolutional activations for modeling part interaction. To model inter-layer part interactions, we extend polynomial predictor to integrate hierarchical activations via kernel fusion. Our work also provides a new perspective for combining convolutional activations from multiple layers. While hypercolumns simply concatenate maps from different layers, and holistically-nested network uses weighted fusion to combine side-outputs, our approach exploits higher-order intra-layer and inter-layer relations for better integration of hierarchical convolutional features. The proposed framework yields more discriminative representation and achieves competitive results on the widely used FGVC datasets.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"25 1","pages":"511-520"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91078721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 163
Monocular Free-Head 3D Gaze Tracking with Deep Learning and Geometry Constraints 基于深度学习和几何约束的单目自由头3D凝视跟踪
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.341
Haoping Deng, Wangjiang Zhu
Free-head 3D gaze tracking outputs both the eye location and the gaze vector in 3D space, and it has wide applications in scenarios such as driver monitoring, advertisement analysis and surveillance. A reliable and low-cost monocular solution is critical for pervasive usage in these areas. Noticing that a gaze vector is a composition of head pose and eyeball movement in a geometrically deterministic way, we propose a novel gaze transform layer to connect separate head pose and eyeball movement models. The proposed decomposition does not suffer from head-gaze correlation overfitting and makes it possible to use datasets existing for other tasks. To add stronger supervision for better network training, we propose a two-step training strategy, which first trains sub-tasks with rough labels and then jointly trains with accurate gaze labels. To enable good cross-subject performance under various conditions, we collect a large dataset which has full coverage of head poses and eyeball movements, contains 200 subjects, and has diverse illumination conditions. Our deep solution achieves state-of-the-art gaze tracking accuracy, reaching 5.6° cross-subject prediction error using a small network running at 1000 fps on a single CPU (excluding face alignment time) and 4.3° cross-subject error with a deeper network.
自由头三维凝视跟踪在三维空间中输出眼睛位置和凝视矢量,在驾驶员监控、广告分析和监控等场景中有着广泛的应用。可靠和低成本的单目解决方案对于在这些领域的普遍使用至关重要。注意到凝视向量是头部姿势和眼球运动以几何确定性的方式组成的,我们提出了一种新的凝视变换层来连接单独的头部姿势和眼球运动模型。所提出的分解不会受到头部凝视相关过拟合的影响,并且可以将现有的数据集用于其他任务。为了给网络训练增加更强的监督,我们提出了一种两步训练策略,首先用粗糙标签训练子任务,然后用精确的注视标签联合训练。为了在各种条件下实现良好的跨主体性能,我们收集了一个大型数据集,该数据集涵盖了头部姿势和眼球运动,包含200个受试者,并且具有不同的照明条件。我们的深度解决方案实现了最先进的凝视跟踪精度,使用在单个CPU上以1000 fps运行的小型网络(不包括面部对齐时间)达到5.6°的交叉对象预测误差,使用更深的网络达到4.3°的交叉对象误差。
{"title":"Monocular Free-Head 3D Gaze Tracking with Deep Learning and Geometry Constraints","authors":"Haoping Deng, Wangjiang Zhu","doi":"10.1109/ICCV.2017.341","DOIUrl":"https://doi.org/10.1109/ICCV.2017.341","url":null,"abstract":"Free-head 3D gaze tracking outputs both the eye location and the gaze vector in 3D space, and it has wide applications in scenarios such as driver monitoring, advertisement analysis and surveillance. A reliable and low-cost monocular solution is critical for pervasive usage in these areas. Noticing that a gaze vector is a composition of head pose and eyeball movement in a geometrically deterministic way, we propose a novel gaze transform layer to connect separate head pose and eyeball movement models. The proposed decomposition does not suffer from head-gaze correlation overfitting and makes it possible to use datasets existing for other tasks. To add stronger supervision for better network training, we propose a two-step training strategy, which first trains sub-tasks with rough labels and then jointly trains with accurate gaze labels. To enable good cross-subject performance under various conditions, we collect a large dataset which has full coverage of head poses and eyeball movements, contains 200 subjects, and has diverse illumination conditions. Our deep solution achieves state-of-the-art gaze tracking accuracy, reaching 5.6° cross-subject prediction error using a small network running at 1000 fps on a single CPU (excluding face alignment time) and 4.3° cross-subject error with a deeper network.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"6 1","pages":"3162-3171"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91169621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 107
Composite Focus Measure for High Quality Depth Maps 用于高质量深度图的复合焦点测量
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.179
P. Sakurikar, P J Narayanan
Depth from focus is a highly accessible method to estimate the 3D structure of everyday scenes. Today’s DSLR and mobile cameras facilitate the easy capture of multiple focused images of a scene. Focus measures (FMs) that estimate the amount of focus at each pixel form the basis of depth-from-focus methods. Several FMs have been proposed in the past and new ones will emerge in the future, each with their own strengths. We estimate a weighted combination of standard FMs that outperforms others on a wide range of scene types. The resulting composite focus measure consists of FMs that are in consensus with one another but not in chorus. Our two-stage pipeline first estimates fine depth at each pixel using the composite focus measure. A cost-volume propagation step then assigns depths from confident pixels to others. We can generate high quality depth maps using just the top five FMs from our composite focus measure. This is a positive step towards depth estimation of everyday scenes with no special equipment.
聚焦深度是估计日常场景三维结构的一种高度可访问的方法。今天的数码单反相机和移动相机可以轻松捕捉场景的多个聚焦图像。焦距测量(FMs)是估计每个像素的焦距的方法的基础。过去已经提出了几种FMs,未来还会出现新的FMs,每种FMs都有自己的优势。我们估计标准FMs的加权组合在广泛的场景类型上优于其他FMs。由此产生的合成对焦测量由相互一致但不一致的FMs组成。我们的两阶段管道首先使用复合焦点测量来估计每个像素的精细深度。然后,成本-体积传播步骤将自信像素的深度分配给其他像素。我们可以仅使用复合对焦测量中的前5个FMs生成高质量的深度图。这是在没有特殊设备的情况下对日常场景进行深度估计的积极步骤。
{"title":"Composite Focus Measure for High Quality Depth Maps","authors":"P. Sakurikar, P J Narayanan","doi":"10.1109/ICCV.2017.179","DOIUrl":"https://doi.org/10.1109/ICCV.2017.179","url":null,"abstract":"Depth from focus is a highly accessible method to estimate the 3D structure of everyday scenes. Today’s DSLR and mobile cameras facilitate the easy capture of multiple focused images of a scene. Focus measures (FMs) that estimate the amount of focus at each pixel form the basis of depth-from-focus methods. Several FMs have been proposed in the past and new ones will emerge in the future, each with their own strengths. We estimate a weighted combination of standard FMs that outperforms others on a wide range of scene types. The resulting composite focus measure consists of FMs that are in consensus with one another but not in chorus. Our two-stage pipeline first estimates fine depth at each pixel using the composite focus measure. A cost-volume propagation step then assigns depths from confident pixels to others. We can generate high quality depth maps using just the top five FMs from our composite focus measure. This is a positive step towards depth estimation of everyday scenes with no special equipment.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"15 2","pages":"1623-1631"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91401826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Shadow Detection with Conditional Generative Adversarial Networks 基于条件生成对抗网络的阴影检测
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.483
Vu Nguyen, Tomas F. Yago Vicente, Maozheng Zhao, Minh Hoai, D. Samaras
We introduce scGAN, a novel extension of conditional Generative Adversarial Networks (GAN) tailored for the challenging problem of shadow detection in images. Previous methods for shadow detection focus on learning the local appearance of shadow regions, while using limited local context reasoning in the form of pairwise potentials in a Conditional Random Field. In contrast, the proposed adversarial approach is able to model higher level relationships and global scene characteristics. We train a shadow detector that corresponds to the generator of a conditional GAN, and augment its shadow accuracy by combining the typical GAN loss with a data loss term. Due to the unbalanced distribution of the shadow labels, we use weighted cross entropy. With the standard GAN architecture, properly setting the weight for the cross entropy would require training multiple GANs, a computationally expensive grid procedure. In scGAN, we introduce an additional sensitivity parameter w to the generator. The proposed approach effectively parameterizes the loss of the trained detector. The resulting shadow detector is a single network that can generate shadow maps corresponding to different sensitivity levels, obviating the need for multiple models and a costly training procedure. We evaluate our method on the large-scale SBU and UCF shadow datasets, and observe up to 17% error reduction with respect to the previous state-of-the-art method.
我们介绍了scGAN,这是条件生成对抗网络(GAN)的新扩展,专为图像中的阴影检测这一具有挑战性的问题而设计。以前的阴影检测方法侧重于学习阴影区域的局部外观,而在条件随机场中以成对电位的形式使用有限的局部上下文推理。相比之下,所提出的对抗方法能够模拟更高层次的关系和全局场景特征。我们训练了一个对应于条件GAN生成器的阴影检测器,并通过将典型GAN损失与数据丢失项相结合来增强其阴影准确性。由于阴影标签的不平衡分布,我们使用加权交叉熵。在标准GAN架构中,正确设置交叉熵的权重需要训练多个GAN,这是一个计算成本很高的网格过程。在scGAN中,我们向发生器引入了一个额外的灵敏度参数w。该方法有效地参数化了训练好的检测器的损耗。由此产生的阴影检测器是一个单一的网络,可以生成对应于不同灵敏度水平的阴影图,避免了对多个模型和昂贵的训练过程的需要。我们在大规模SBU和UCF阴影数据集上评估了我们的方法,并观察到与之前最先进的方法相比,误差减少了17%。
{"title":"Shadow Detection with Conditional Generative Adversarial Networks","authors":"Vu Nguyen, Tomas F. Yago Vicente, Maozheng Zhao, Minh Hoai, D. Samaras","doi":"10.1109/ICCV.2017.483","DOIUrl":"https://doi.org/10.1109/ICCV.2017.483","url":null,"abstract":"We introduce scGAN, a novel extension of conditional Generative Adversarial Networks (GAN) tailored for the challenging problem of shadow detection in images. Previous methods for shadow detection focus on learning the local appearance of shadow regions, while using limited local context reasoning in the form of pairwise potentials in a Conditional Random Field. In contrast, the proposed adversarial approach is able to model higher level relationships and global scene characteristics. We train a shadow detector that corresponds to the generator of a conditional GAN, and augment its shadow accuracy by combining the typical GAN loss with a data loss term. Due to the unbalanced distribution of the shadow labels, we use weighted cross entropy. With the standard GAN architecture, properly setting the weight for the cross entropy would require training multiple GANs, a computationally expensive grid procedure. In scGAN, we introduce an additional sensitivity parameter w to the generator. The proposed approach effectively parameterizes the loss of the trained detector. The resulting shadow detector is a single network that can generate shadow maps corresponding to different sensitivity levels, obviating the need for multiple models and a costly training procedure. We evaluate our method on the large-scale SBU and UCF shadow datasets, and observe up to 17% error reduction with respect to the previous state-of-the-art method.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"18 1","pages":"4520-4528"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83463610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 156
Image Super-Resolution Using Dense Skip Connections 使用密集跳过连接的图像超分辨率
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.514
T. Tong, Gen Li, Xiejie Liu, Qinquan Gao
Recent studies have shown that the performance of single-image super-resolution methods can be significantly boosted by using deep convolutional neural networks. In this study, we present a novel single-image super-resolution method by introducing dense skip connections in a very deep network. In the proposed network, the feature maps of each layer are propagated into all subsequent layers, providing an effective way to combine the low-level features and high-level features to boost the reconstruction performance. In addition, the dense skip connections in the network enable short paths to be built directly from the output to each layer, alleviating the vanishing-gradient problem of very deep networks. Moreover, deconvolution layers are integrated into the network to learn the upsampling filters and to speedup the reconstruction process. Further, the proposed method substantially reduces the number of parameters, enhancing the computational efficiency. We evaluate the proposed method using images from four benchmark datasets and set a new state of the art.
最近的研究表明,使用深度卷积神经网络可以显著提高单图像超分辨率方法的性能。在这项研究中,我们提出了一种新的单图像超分辨率方法,通过在一个非常深的网络中引入密集的跳跃连接。在该网络中,每一层的特征映射被传播到所有后续层,提供了一种有效的方法来结合低级特征和高级特征来提高重建性能。此外,网络中密集的跳跃连接使得从输出到每一层可以直接建立短路径,缓解了非常深的网络的梯度消失问题。此外,在网络中加入反卷积层来学习上采样滤波器,加快重建过程。此外,该方法大大减少了参数的数量,提高了计算效率。我们使用来自四个基准数据集的图像来评估所提出的方法,并设置了一个新的技术状态。
{"title":"Image Super-Resolution Using Dense Skip Connections","authors":"T. Tong, Gen Li, Xiejie Liu, Qinquan Gao","doi":"10.1109/ICCV.2017.514","DOIUrl":"https://doi.org/10.1109/ICCV.2017.514","url":null,"abstract":"Recent studies have shown that the performance of single-image super-resolution methods can be significantly boosted by using deep convolutional neural networks. In this study, we present a novel single-image super-resolution method by introducing dense skip connections in a very deep network. In the proposed network, the feature maps of each layer are propagated into all subsequent layers, providing an effective way to combine the low-level features and high-level features to boost the reconstruction performance. In addition, the dense skip connections in the network enable short paths to be built directly from the output to each layer, alleviating the vanishing-gradient problem of very deep networks. Moreover, deconvolution layers are integrated into the network to learn the upsampling filters and to speedup the reconstruction process. Further, the proposed method substantially reduces the number of parameters, enhancing the computational efficiency. We evaluate the proposed method using images from four benchmark datasets and set a new state of the art.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"6 1","pages":"4809-4817"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78333469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 989
High Order Tensor Formulation for Convolutional Sparse Coding 卷积稀疏编码的高阶张量公式
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.197
Adel Bibi, Bernard Ghanem
Convolutional sparse coding (CSC) has gained attention for its successful role as a reconstruction and a classification tool in the computer vision and machine learning community. Current CSC methods can only reconstruct singlefeature 2D images independently. However, learning multidimensional dictionaries and sparse codes for the reconstruction of multi-dimensional data is very important, as it examines correlations among all the data jointly. This provides more capacity for the learned dictionaries to better reconstruct data. In this paper, we propose a generic and novel formulation for the CSC problem that can handle an arbitrary order tensor of data. Backed with experimental results, our proposed formulation can not only tackle applications that are not possible with standard CSC solvers, including colored video reconstruction (5D- tensors), but it also performs favorably in reconstruction with much fewer parameters as compared to naive extensions of standard CSC to multiple features/channels.
卷积稀疏编码(CSC)作为一种重建和分类工具在计算机视觉和机器学习领域获得了广泛的关注。目前的CSC方法只能独立地重建单一特征的二维图像。然而,学习多维字典和稀疏代码对于多维数据的重建是非常重要的,因为它共同检查了所有数据之间的相关性。这为学习到的字典提供了更大的容量来更好地重构数据。在本文中,我们提出了一个通用的、新颖的CSC问题的公式,它可以处理任意阶数据张量。在实验结果的支持下,我们提出的公式不仅可以解决标准CSC求解器无法解决的应用,包括彩色视频重建(5D张量),而且与标准CSC的幼稚扩展到多个特征/通道相比,它在参数少得多的重建中也表现良好。
{"title":"High Order Tensor Formulation for Convolutional Sparse Coding","authors":"Adel Bibi, Bernard Ghanem","doi":"10.1109/ICCV.2017.197","DOIUrl":"https://doi.org/10.1109/ICCV.2017.197","url":null,"abstract":"Convolutional sparse coding (CSC) has gained attention for its successful role as a reconstruction and a classification tool in the computer vision and machine learning community. Current CSC methods can only reconstruct singlefeature 2D images independently. However, learning multidimensional dictionaries and sparse codes for the reconstruction of multi-dimensional data is very important, as it examines correlations among all the data jointly. This provides more capacity for the learned dictionaries to better reconstruct data. In this paper, we propose a generic and novel formulation for the CSC problem that can handle an arbitrary order tensor of data. Backed with experimental results, our proposed formulation can not only tackle applications that are not possible with standard CSC solvers, including colored video reconstruction (5D- tensors), but it also performs favorably in reconstruction with much fewer parameters as compared to naive extensions of standard CSC to multiple features/channels.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"77 1","pages":"1790-1798"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76660225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal 龙卷风:一个时空卷积回归网络的视频动作建议
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.619
Hongyuan Zhu, Romain Vial, Shijian Lu
Given a video clip, action proposal aims to quickly generate a number of spatio-temporal tubes that enclose candidate human activities. Recently, the regression-based networks and long-term recurrent convolutional network (L-RCN) have demonstrated superior performance in object detection and action recognition. However, the regression-based detectors perform inference without considering the temporal context among neighboring frames, and the LRC-N using global visual percepts lacks the capability to capture local temporal dynamics. In this paper, we present a novel framework called TORNADO for human action proposal detection in un-trimmed video clips. Specifically, we propose a spatio-temporal convolutional network that combines the advantages of regression-based detector and L-RCN by empowering Convolutional LSTM with regression capability. Our approach consists of a temporal convolutional regression network (T-CRN) and a spatial regression network (S-CRN) which are trained end-to-end on both RGB and optical flow streams. They fuse appearance, motion and temporal contexts to regress the bounding boxes of candidate human actions simultaneously in 28 FPS. The action proposals are constructed by solving dynamic programming with peak trimming of the generated action boxes. Extensive experiments on the challenging UCF-101 and UCF-Sports datasets show that our method achieves superior performance as compared with the state-of-the-arts.
给定一个视频片段,行动提案旨在快速生成一些包含候选人类活动的时空管。近年来,基于回归的网络和长期递归卷积网络(L-RCN)在目标检测和动作识别方面表现出了优异的性能。然而,基于回归的检测器执行推理时没有考虑相邻帧之间的时间上下文,并且使用全局视觉感知的LRC-N缺乏捕获局部时间动态的能力。在本文中,我们提出了一个名为TORNADO的新框架,用于在未修剪的视频片段中检测人类动作建议。具体来说,我们提出了一个时空卷积网络,它结合了基于回归的检测器和L-RCN的优点,赋予卷积LSTM回归能力。我们的方法包括一个时间卷积回归网络(T-CRN)和一个空间回归网络(S-CRN),它们在RGB和光流上进行端到端训练。它们融合了外观,动作和时间背景,从而在28 FPS中同时还原候选人类行动的边界框。通过对生成的动作框进行削峰处理,求解动态规划,构造动作建议。在具有挑战性的UCF-101和UCF-Sports数据集上进行的大量实验表明,与最先进的方法相比,我们的方法具有优越的性能。
{"title":"TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal","authors":"Hongyuan Zhu, Romain Vial, Shijian Lu","doi":"10.1109/ICCV.2017.619","DOIUrl":"https://doi.org/10.1109/ICCV.2017.619","url":null,"abstract":"Given a video clip, action proposal aims to quickly generate a number of spatio-temporal tubes that enclose candidate human activities. Recently, the regression-based networks and long-term recurrent convolutional network (L-RCN) have demonstrated superior performance in object detection and action recognition. However, the regression-based detectors perform inference without considering the temporal context among neighboring frames, and the LRC-N using global visual percepts lacks the capability to capture local temporal dynamics. In this paper, we present a novel framework called TORNADO for human action proposal detection in un-trimmed video clips. Specifically, we propose a spatio-temporal convolutional network that combines the advantages of regression-based detector and L-RCN by empowering Convolutional LSTM with regression capability. Our approach consists of a temporal convolutional regression network (T-CRN) and a spatial regression network (S-CRN) which are trained end-to-end on both RGB and optical flow streams. They fuse appearance, motion and temporal contexts to regress the bounding boxes of candidate human actions simultaneously in 28 FPS. The action proposals are constructed by solving dynamic programming with peak trimming of the generated action boxes. Extensive experiments on the challenging UCF-101 and UCF-Sports datasets show that our method achieves superior performance as compared with the state-of-the-arts.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"35 1","pages":"5814-5822"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77093847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
Space-Time Localization and Mapping 时空定位与映射
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.422
Minhaeng Lee, Charless C. Fowlkes
This paper addresses the problem of building a spatiotemporal model of the world from a stream of time-stamped data. Unlike traditional models for simultaneous localization and mapping (SLAM) and structure-from-motion (SfM) which focus on recovering a single rigid 3D model, we tackle the problem of mapping scenes in which dynamic components appear, move and disappear independently of each other over time. We introduce a simple generative probabilistic model of 4D structure which specifies location, spatial and temporal extent of rigid surface patches by local Gaussian mixtures. We fit this model to a time-stamped stream of input data using expectation-maximization to estimate the model structure parameters (mapping) and the alignment of the input data to the model (localization). By explicitly representing the temporal extent and observability of surfaces in a scene, our method yields superior localization and reconstruction relative to baselines that assume a static 3D scene. We carry out experiments on both synthetic RGB-D data streams as well as challenging real-world datasets, tracking scene dynamics in a human workspace over the course of several weeks.
本文解决了从时间戳数据流构建世界时空模型的问题。与传统的同步定位和映射(SLAM)和运动结构(SfM)模型不同,我们解决了映射场景的问题,其中动态组件随着时间的推移彼此独立地出现、移动和消失。我们引入了一种简单的四维结构生成概率模型,该模型通过局部高斯混合来指定刚性表面斑块的位置、空间和时间范围。我们使用期望最大化来估计模型结构参数(映射)和输入数据与模型的对齐(定位),从而将该模型拟合到带有时间戳的输入数据流中。通过明确表示场景中表面的时间范围和可观察性,我们的方法相对于假设静态3D场景的基线产生了更好的定位和重建。我们在合成RGB-D数据流和具有挑战性的现实世界数据集上进行实验,在几个星期的时间里跟踪人类工作空间中的场景动态。
{"title":"Space-Time Localization and Mapping","authors":"Minhaeng Lee, Charless C. Fowlkes","doi":"10.1109/ICCV.2017.422","DOIUrl":"https://doi.org/10.1109/ICCV.2017.422","url":null,"abstract":"This paper addresses the problem of building a spatiotemporal model of the world from a stream of time-stamped data. Unlike traditional models for simultaneous localization and mapping (SLAM) and structure-from-motion (SfM) which focus on recovering a single rigid 3D model, we tackle the problem of mapping scenes in which dynamic components appear, move and disappear independently of each other over time. We introduce a simple generative probabilistic model of 4D structure which specifies location, spatial and temporal extent of rigid surface patches by local Gaussian mixtures. We fit this model to a time-stamped stream of input data using expectation-maximization to estimate the model structure parameters (mapping) and the alignment of the input data to the model (localization). By explicitly representing the temporal extent and observability of surfaces in a scene, our method yields superior localization and reconstruction relative to baselines that assume a static 3D scene. We carry out experiments on both synthetic RGB-D data streams as well as challenging real-world datasets, tracking scene dynamics in a human workspace over the course of several weeks.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"78 1","pages":"3932-3941"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83233708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2017 IEEE International Conference on Computer Vision (ICCV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1