首页 > 最新文献

2017 IEEE International Conference on Computer Vision (ICCV)最新文献

英文 中文
Shape Inpainting Using 3D Generative Adversarial Network and Recurrent Convolutional Networks 使用三维生成对抗网络和循环卷积网络的形状绘制
Pub Date : 2017-11-17 DOI: 10.1109/ICCV.2017.252
Weiyue Wang, Qiangui Huang, Suya You, Chao Yang, U. Neumann
Recent advances in convolutional neural networks have shown promising results in 3D shape completion. But due to GPU memory limitations, these methods can only produce low-resolution outputs. To inpaint 3D models with semantic plausibility and contextual details, we introduce a hybrid framework that combines a 3D Encoder-Decoder Generative Adversarial Network (3D-ED-GAN) and a Longterm Recurrent Convolutional Network (LRCN). The 3DED- GAN is a 3D convolutional neural network trained with a generative adversarial paradigm to fill missing 3D data in low-resolution. LRCN adopts a recurrent neural network architecture to minimize GPU memory usage and incorporates an Encoder-Decoder pair into a Long Shortterm Memory Network. By handling the 3D model as a sequence of 2D slices, LRCN transforms a coarse 3D shape into a more complete and higher resolution volume. While 3D-ED-GAN captures global contextual structure of the 3D shape, LRCN localizes the fine-grained details. Experimental results on both real-world and synthetic data show reconstructions from corrupted models result in complete and high-resolution 3D objects.
卷积神经网络的最新进展在三维形状补全方面显示出有希望的结果。但是由于GPU内存的限制,这些方法只能产生低分辨率输出。为了绘制具有语义合理性和上下文细节的3D模型,我们引入了一个混合框架,该框架结合了3D编码器-解码器生成对抗网络(3D- ed - gan)和长期循环卷积网络(LRCN)。3DED- GAN是一种用生成对抗范式训练的3D卷积神经网络,用于填补低分辨率缺失的3D数据。LRCN采用循环神经网络架构,最大限度地减少GPU内存使用,并将编码器-解码器对集成到长短期记忆网络中。LRCN通过将3D模型处理为一系列2D切片,将粗糙的3D形状转换为更完整、更高分辨率的体积。3D- ed - gan捕获3D形状的全局上下文结构,而LRCN则定位细粒度细节。真实世界和合成数据的实验结果表明,损坏模型的重建结果是完整和高分辨率的3D物体。
{"title":"Shape Inpainting Using 3D Generative Adversarial Network and Recurrent Convolutional Networks","authors":"Weiyue Wang, Qiangui Huang, Suya You, Chao Yang, U. Neumann","doi":"10.1109/ICCV.2017.252","DOIUrl":"https://doi.org/10.1109/ICCV.2017.252","url":null,"abstract":"Recent advances in convolutional neural networks have shown promising results in 3D shape completion. But due to GPU memory limitations, these methods can only produce low-resolution outputs. To inpaint 3D models with semantic plausibility and contextual details, we introduce a hybrid framework that combines a 3D Encoder-Decoder Generative Adversarial Network (3D-ED-GAN) and a Longterm Recurrent Convolutional Network (LRCN). The 3DED- GAN is a 3D convolutional neural network trained with a generative adversarial paradigm to fill missing 3D data in low-resolution. LRCN adopts a recurrent neural network architecture to minimize GPU memory usage and incorporates an Encoder-Decoder pair into a Long Shortterm Memory Network. By handling the 3D model as a sequence of 2D slices, LRCN transforms a coarse 3D shape into a more complete and higher resolution volume. While 3D-ED-GAN captures global contextual structure of the 3D shape, LRCN localizes the fine-grained details. Experimental results on both real-world and synthetic data show reconstructions from corrupted models result in complete and high-resolution 3D objects.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"46 1","pages":"2317-2325"},"PeriodicalIF":0.0,"publicationDate":"2017-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80924800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 150
Synergy between Face Alignment and Tracking via Discriminative Global Consensus Optimization 基于判别性全局共识优化的人脸对齐与跟踪协同效应
Pub Date : 2017-10-26 DOI: 10.1109/ICCV.2017.409
M. H. Khan, J. McDonagh, Georgios Tzimiropoulos
An open question in facial landmark localization in video is whether one should perform tracking or tracking-by-detection (i.e. face alignment). Tracking produces fittings of high accuracy but is prone to drifting. Tracking-by-detection is drift-free but results in low accuracy fittings. To provide a solution to this problem, we describe the very first, to the best of our knowledge, synergistic approach between detection (face alignment) and tracking which completely eliminates drifting from face tracking, and does not merely perform tracking-by-detection. Our first main contribution is to show that one can achieve this synergy between detection and tracking using a principled optimization framework based on the theory of Global Variable Consensus Optimization using ADMM; Our second contribution is to show how the proposed analytic framework can be integrated within state-of-the-art discriminative methods for face alignment and tracking based on cascaded regression and deeply learned features. Overall, we call our method Discriminative Global Consensus Model (DGCM). Our third contribution is to show that DGCM achieves large performance improvement over the currently best performing face tracking methods on the most challenging category of the 300-VW dataset.
视频中人脸标记定位的一个开放性问题是应该进行跟踪还是检测跟踪(即人脸对齐)。跟踪产生的配件精度高,但容易漂移。检测跟踪是无漂移的,但导致配件精度低。为了提供这个问题的解决方案,我们首先描述了我们所知的检测(人脸对齐)和跟踪之间的协同方法,它完全消除了人脸跟踪的漂移,而不仅仅是执行检测跟踪。我们的第一个主要贡献是表明可以使用基于使用ADMM的全局变量共识优化理论的原则优化框架实现检测和跟踪之间的这种协同作用;我们的第二个贡献是展示了如何将所提出的分析框架集成到基于级联回归和深度学习特征的人脸对齐和跟踪的最先进的判别方法中。总的来说,我们称我们的方法为判别全球共识模型(DGCM)。我们的第三个贡献是表明DGCM在300-VW数据集中最具挑战性的类别上比目前表现最好的人脸跟踪方法取得了很大的性能改进。
{"title":"Synergy between Face Alignment and Tracking via Discriminative Global Consensus Optimization","authors":"M. H. Khan, J. McDonagh, Georgios Tzimiropoulos","doi":"10.1109/ICCV.2017.409","DOIUrl":"https://doi.org/10.1109/ICCV.2017.409","url":null,"abstract":"An open question in facial landmark localization in video is whether one should perform tracking or tracking-by-detection (i.e. face alignment). Tracking produces fittings of high accuracy but is prone to drifting. Tracking-by-detection is drift-free but results in low accuracy fittings. To provide a solution to this problem, we describe the very first, to the best of our knowledge, synergistic approach between detection (face alignment) and tracking which completely eliminates drifting from face tracking, and does not merely perform tracking-by-detection. Our first main contribution is to show that one can achieve this synergy between detection and tracking using a principled optimization framework based on the theory of Global Variable Consensus Optimization using ADMM; Our second contribution is to show how the proposed analytic framework can be integrated within state-of-the-art discriminative methods for face alignment and tracking based on cascaded regression and deeply learned features. Overall, we call our method Discriminative Global Consensus Model (DGCM). Our third contribution is to show that DGCM achieves large performance improvement over the currently best performing face tracking methods on the most challenging category of the 300-VW dataset.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"13 1","pages":"3811-3819"},"PeriodicalIF":0.0,"publicationDate":"2017-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73879259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Multi-view Dynamic Shape Refinement Using Local Temporal Integration 基于局部时间积分的多视图动态形状优化
Pub Date : 2017-10-22 DOI: 10.1109/ICCV.2017.336
Vincent Leroy, Jean-Sébastien Franco, Edmond Boyer
We consider 4D shape reconstructions in multi-view environments and investigate how to exploit temporal redundancy for precision refinement. In addition to being beneficial to many dynamic multi-view scenarios this also enables larger scenes where such increased precision can compensate for the reduced spatial resolution per image frame. With precision and scalability in mind, we propose a symmetric (non-causal) local time-window geometric integration scheme over temporal sequences, where shape reconstructions are refined framewise by warping local and reliable geometric regions of neighboring frames to them. This is in contrast to recent comparable approaches targeting a different context with more compact scenes and real-time applications. These usually use a single dense volumetric update space or geometric template, which they causally track and update globally frame by frame, with limitations in scalability for larger scenes and in topology and precision with a template based strategy. Our templateless and local approach is a first step towards temporal shape super-resolution. We show that it improves reconstruction accuracy by considering multiple frames. To this purpose, and in addition to real data examples, we introduce a multi-camera synthetic dataset that provides ground-truth data for mid-scale dynamic scenes.
我们考虑了多视图环境下的四维形状重建,并研究了如何利用时间冗余进行精度改进。除了对许多动态多视图场景有益之外,它还支持更大的场景,在这些场景中,这样增加的精度可以补偿每帧图像降低的空间分辨率。考虑到精度和可扩展性,我们在时间序列上提出了一种对称(非因果)局部时间窗几何积分方案,其中形状重构通过将邻近帧的局部和可靠几何区域扭曲到它们来细化帧。这与最近针对更紧凑的场景和实时应用程序的不同上下文的类似方法形成鲜明对比。它们通常使用单个密集的体积更新空间或几何模板,它们会逐帧跟踪和全局更新,这在较大场景的可扩展性以及基于模板的策略的拓扑和精度方面存在限制。我们的无模板和局部方法是向时间形状超分辨率迈出的第一步。我们证明了该方法通过考虑多帧来提高重建精度。为此,除了真实的数据示例外,我们还引入了一个多相机合成数据集,该数据集为中等规模的动态场景提供了真实的数据。
{"title":"Multi-view Dynamic Shape Refinement Using Local Temporal Integration","authors":"Vincent Leroy, Jean-Sébastien Franco, Edmond Boyer","doi":"10.1109/ICCV.2017.336","DOIUrl":"https://doi.org/10.1109/ICCV.2017.336","url":null,"abstract":"We consider 4D shape reconstructions in multi-view environments and investigate how to exploit temporal redundancy for precision refinement. In addition to being beneficial to many dynamic multi-view scenarios this also enables larger scenes where such increased precision can compensate for the reduced spatial resolution per image frame. With precision and scalability in mind, we propose a symmetric (non-causal) local time-window geometric integration scheme over temporal sequences, where shape reconstructions are refined framewise by warping local and reliable geometric regions of neighboring frames to them. This is in contrast to recent comparable approaches targeting a different context with more compact scenes and real-time applications. These usually use a single dense volumetric update space or geometric template, which they causally track and update globally frame by frame, with limitations in scalability for larger scenes and in topology and precision with a template based strategy. Our templateless and local approach is a first step towards temporal shape super-resolution. We show that it improves reconstruction accuracy by considering multiple frames. To this purpose, and in addition to real data examples, we introduce a multi-camera synthetic dataset that provides ground-truth data for mid-scale dynamic scenes.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"21 1","pages":"3113-3122"},"PeriodicalIF":0.0,"publicationDate":"2017-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79772107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Detect to Track and Track to Detect 检测到跟踪和跟踪到检测
Pub Date : 2017-10-11 DOI: 10.1109/ICCV.2017.330
Christoph Feichtenhofer, A. Pinz, Andrew Zisserman
Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed.
最近用于视频中目标类别的高精度检测和跟踪方法由复杂的多阶段解决方案组成,这些解决方案每年都变得越来越麻烦。在本文中,我们提出了一种联合检测和跟踪的卷积神经网络架构,以一种简单有效的方式解决了任务。我们的贡献有三个方面:(i)我们建立了一个用于同时检测和跟踪的卷积神经网络架构,使用多任务目标进行基于帧的对象检测和跨帧轨迹回归;(ii)我们引入相关特征,表示对象在时间上的共现,以帮助卷积神经网络在跟踪期间;(iii)基于我们的跨帧轨迹链接帧级检测,以在视频级产生高精度检测。我们用于时空目标检测的ConvNet架构在大规模ImageNet VID数据集上进行了评估,并获得了最先进的结果。我们的方法比上次ImageNet挑战的获胜方法提供了更好的单模型性能,同时在概念上简单得多。最后,我们表明,通过增加时间步幅,我们可以显著提高跟踪器的速度。
{"title":"Detect to Track and Track to Detect","authors":"Christoph Feichtenhofer, A. Pinz, Andrew Zisserman","doi":"10.1109/ICCV.2017.330","DOIUrl":"https://doi.org/10.1109/ICCV.2017.330","url":null,"abstract":"Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"3057-3065"},"PeriodicalIF":0.0,"publicationDate":"2017-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77303403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 485
Deeper, Broader and Artier Domain Generalization 更深、更广、更精细的领域泛化
Pub Date : 2017-10-09 DOI: 10.1109/ICCV.2017.591
Da Li, Yongxin Yang, Yi-Zhe Song, Timothy M. Hospedales
The problem of domain generalization is to learn from multiple training domains, and extract a domain-agnostic model that can then be applied to an unseen domain. Domain generalization (DG) has a clear motivation in contexts where there are target domains with distinct characteristics, yet sparse data for training. For example recognition in sketch images, which are distinctly more abstract and rarer than photos. Nevertheless, DG methods have primarily been evaluated on photo-only benchmarks focusing on alleviating the dataset bias where both problems of domain distinctiveness and data sparsity can be minimal. We argue that these benchmarks are overly straightforward, and show that simple deep learning baselines perform surprisingly well on them. In this paper, we make two main contributions: Firstly, we build upon the favorable domain shift-robust properties of deep learning methods, and develop a low-rank parameterized CNN model for end-to-end DG learning. Secondly, we develop a DG benchmark dataset covering photo, sketch, cartoon and painting domains. This is both more practically relevant, and harder (bigger domain shift) than existing benchmarks. The results show that our method outperforms existing DG alternatives, and our dataset provides a more significant DG challenge to drive future research.
领域泛化的问题是从多个训练领域中学习,并提取一个领域不可知的模型,然后将其应用于未知的领域。领域泛化(DG)在具有不同特征的目标领域和稀疏的训练数据的情况下具有明确的动机。例如对素描图像的识别,这些图像明显比照片更抽象、更罕见。然而,DG方法主要是在照片基准上进行评估的,重点是减轻数据集偏差,其中领域独特性和数据稀疏性的问题都可以最小化。我们认为这些基准过于直接,并表明简单的深度学习基线在这些基准上表现得非常好。在本文中,我们做出了两个主要贡献:首先,我们建立了深度学习方法的有利域漂移鲁棒性,并开发了一个低秩参数化CNN模型用于端到端DG学习。其次,我们开发了一个涵盖照片、素描、漫画和绘画领域的DG基准数据集。这比现有的基准测试更实际,也更困难(更大的领域转移)。结果表明,我们的方法优于现有的DG替代方案,我们的数据集为推动未来的研究提供了更重要的DG挑战。
{"title":"Deeper, Broader and Artier Domain Generalization","authors":"Da Li, Yongxin Yang, Yi-Zhe Song, Timothy M. Hospedales","doi":"10.1109/ICCV.2017.591","DOIUrl":"https://doi.org/10.1109/ICCV.2017.591","url":null,"abstract":"The problem of domain generalization is to learn from multiple training domains, and extract a domain-agnostic model that can then be applied to an unseen domain. Domain generalization (DG) has a clear motivation in contexts where there are target domains with distinct characteristics, yet sparse data for training. For example recognition in sketch images, which are distinctly more abstract and rarer than photos. Nevertheless, DG methods have primarily been evaluated on photo-only benchmarks focusing on alleviating the dataset bias where both problems of domain distinctiveness and data sparsity can be minimal. We argue that these benchmarks are overly straightforward, and show that simple deep learning baselines perform surprisingly well on them. In this paper, we make two main contributions: Firstly, we build upon the favorable domain shift-robust properties of deep learning methods, and develop a low-rank parameterized CNN model for end-to-end DG learning. Secondly, we develop a DG benchmark dataset covering photo, sketch, cartoon and painting domains. This is both more practically relevant, and harder (bigger domain shift) than existing benchmarks. The results show that our method outperforms existing DG alternatives, and our dataset provides a more significant DG challenge to drive future research.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"84 1","pages":"5543-5551"},"PeriodicalIF":0.0,"publicationDate":"2017-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90855023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1022
Depth Estimation Using Structured Light Flow — Analysis of Projected Pattern Flow on an Object’s Surface 使用结构光流进行深度估计-分析物体表面上的投影模式流
Pub Date : 2017-10-02 DOI: 10.1109/ICCV.2017.497
Furukawa Ryo, R. Sagawa, Hiroshi Kawasaki
Shape reconstruction techniques using structured light have been widely researched and developed due to their robustness, high precision, and density. Because the techniques are based on decoding a pattern to find correspondences, it implicitly requires that the projected patterns be clearly captured by an image sensor, i.e., to avoid defocus and motion blur of the projected pattern. Although intensive researches have been conducted for solving defocus blur, few researches for motion blur and only solution is to capture with extremely fast shutter speed. In this paper, unlike the previous approaches, we actively utilize motion blur, which we refer to as a light flow, to estimate depth. Analysis reveals that minimum two light flows, which are retrieved from two projected patterns on the object, are required for depth estimation. To retrieve two light flows at the same time, two sets of parallel line patterns are illuminated from two video projectors and the size of motion blur of each line is precisely measured. By analyzing the light flows, i.e. lengths of the blurs, scene depth information is estimated. In the experiments, 3D shapes of fast moving objects, which are inevitably captured with motion blur, are successfully reconstructed by our technique.
基于结构光的形状重建技术以其鲁棒性、高精度和高密度得到了广泛的研究和发展。由于该技术是基于解码模式来找到对应关系,因此它隐含地要求投影模式被图像传感器清楚地捕获,即避免投影模式的散焦和运动模糊。虽然对于散焦模糊的解决已经进行了大量的研究,但是对于运动模糊的研究却很少,唯一的解决方法就是用极快的快门速度进行捕捉。在本文中,与之前的方法不同,我们积极地利用运动模糊,我们称之为光流,来估计深度。分析表明,深度估计需要从物体上的两个投影模式中检索到的最小两个光流。为了同时检索两个光流,从两个视频投影仪照射两组平行线模式,并精确测量每条线的运动模糊大小。通过分析光流,即模糊的长度,估计场景深度信息。在实验中,我们的技术成功地重建了快速运动物体的三维形状,这些物体不可避免地会被运动模糊捕获。
{"title":"Depth Estimation Using Structured Light Flow — Analysis of Projected Pattern Flow on an Object’s Surface","authors":"Furukawa Ryo, R. Sagawa, Hiroshi Kawasaki","doi":"10.1109/ICCV.2017.497","DOIUrl":"https://doi.org/10.1109/ICCV.2017.497","url":null,"abstract":"Shape reconstruction techniques using structured light have been widely researched and developed due to their robustness, high precision, and density. Because the techniques are based on decoding a pattern to find correspondences, it implicitly requires that the projected patterns be clearly captured by an image sensor, i.e., to avoid defocus and motion blur of the projected pattern. Although intensive researches have been conducted for solving defocus blur, few researches for motion blur and only solution is to capture with extremely fast shutter speed. In this paper, unlike the previous approaches, we actively utilize motion blur, which we refer to as a light flow, to estimate depth. Analysis reveals that minimum two light flows, which are retrieved from two projected patterns on the object, are required for depth estimation. To retrieve two light flows at the same time, two sets of parallel line patterns are illuminated from two video projectors and the size of motion blur of each line is precisely measured. By analyzing the light flows, i.e. lengths of the blurs, scene depth information is estimated. In the experiments, 3D shapes of fast moving objects, which are inevitably captured with motion blur, are successfully reconstructed by our technique.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"16 1","pages":"4650-4658"},"PeriodicalIF":0.0,"publicationDate":"2017-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88513060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Temporal Shape Super-Resolution by Intra-frame Motion Encoding Using High-fps Structured Light 使用高fps结构光的帧内运动编码的时间形状超分辨率
Pub Date : 2017-10-02 DOI: 10.1109/ICCV.2017.22
Yuki Shiba, S. Ono, Furukawa Ryo, S. Hiura, Hiroshi Kawasaki
One of the solutions of depth imaging of moving scene is to project a static pattern on the object and use just a single image for reconstruction. However, if the motion of the object is too fast with respect to the exposure time of the image sensor, patterns on the captured image are blurred and reconstruction fails. In this paper, we impose multiple projection patterns into each single captured image to realize temporal super resolution of the depth image sequences. With our method, multiple patterns are projected onto the object with higher fps than possible with a camera. In this case, the observed pattern varies depending on the depth and motion of the object, so we can extract temporal information of the scene from each single image. The decoding process is realized using a learning-based approach where no geometric calibration is needed. Experiments confirm the effectiveness of our method where sequential shapes are reconstructed from a single image. Both quantitative evaluations and comparisons with recent techniques were also conducted.
运动景物深度成像的解决方案之一是在物体上投影静态图案,仅使用单幅图像进行重建。但是,如果物体的运动相对于图像传感器的曝光时间太快,则捕获图像上的图案会模糊并且重建失败。在本文中,我们在每张捕获的图像中施加多个投影模式,以实现深度图像序列的时间超分辨率。使用我们的方法,多个图案以比相机更高的fps投射到物体上。在这种情况下,观察到的模式根据物体的深度和运动而变化,因此我们可以从每张图像中提取场景的时间信息。解码过程使用基于学习的方法实现,不需要几何校准。实验验证了该方法在单幅图像重建序列形状时的有效性。还进行了定量评价和与最近技术的比较。
{"title":"Temporal Shape Super-Resolution by Intra-frame Motion Encoding Using High-fps Structured Light","authors":"Yuki Shiba, S. Ono, Furukawa Ryo, S. Hiura, Hiroshi Kawasaki","doi":"10.1109/ICCV.2017.22","DOIUrl":"https://doi.org/10.1109/ICCV.2017.22","url":null,"abstract":"One of the solutions of depth imaging of moving scene is to project a static pattern on the object and use just a single image for reconstruction. However, if the motion of the object is too fast with respect to the exposure time of the image sensor, patterns on the captured image are blurred and reconstruction fails. In this paper, we impose multiple projection patterns into each single captured image to realize temporal super resolution of the depth image sequences. With our method, multiple patterns are projected onto the object with higher fps than possible with a camera. In this case, the observed pattern varies depending on the depth and motion of the object, so we can extract temporal information of the scene from each single image. The decoding process is realized using a learning-based approach where no geometric calibration is needed. Experiments confirm the effectiveness of our method where sequential shapes are reconstructed from a single image. Both quantitative evaluations and comparisons with recent techniques were also conducted.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"11 1","pages":"115-123"},"PeriodicalIF":0.0,"publicationDate":"2017-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84328705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Misalignment-Robust Joint Filter for Cross-Modal Image Pairs 跨模态图像对的失调-鲁棒联合滤波
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.357
Takashi Shibata, Masayuki Tanaka, M. Okutomi
Although several powerful joint filters for cross-modal image pairs have been proposed, the existing joint filters generate severe artifacts when there are misalignments between a target and a guidance images. Our goal is to generate an artifact-free output image even from the misaligned target and guidance images. We propose a novel misalignment-robust joint filter based on weight-volume-based image composition and joint-filter cost volume. Our proposed method first generates a set of translated guidances. Next, the joint-filter cost volume and a set of filtered images are computed from the target image and the set of the translated guidances. Then, a weight volume is obtained from the joint-filter cost volume while considering a spatial smoothness and a label-sparseness. The final output image is composed by fusing the set of the filtered images with the weight volume for the filtered images. The key is to generate the final output image directly from the set of the filtered images by weighted averaging using the weight volume that is obtained from the joint-filter cost volume. The proposed framework is widely applicable and can involve any kind of joint filter. Experimental results show that the proposed method is effective for various applications including image denosing, image up-sampling, haze removal and depth map interpolation.
虽然已经提出了几种功能强大的跨模态图像对联合滤波器,但现有的联合滤波器在目标和制导图像之间存在不对准时会产生严重的伪影。我们的目标是生成一个无伪影的输出图像,即使从不对齐的目标和制导图像。提出了一种基于权重体积的图像合成和联合滤波代价体积的新型纠偏鲁棒联合滤波器。我们提出的方法首先生成一组翻译后的指南。然后,从目标图像和翻译制导集计算联合滤波代价体积和一组滤波图像。然后,在考虑空间平滑性和标签稀疏性的情况下,由联合滤波代价体积得到权重体积。将过滤后的图像集合与过滤后的图像的权重体积进行融合,得到最终的输出图像。关键是利用联合滤波代价体积得到的权重体积进行加权平均,直接从滤波后的图像集合中生成最终的输出图像。该框架具有广泛的适用性,适用于任何类型的联合滤波器。实验结果表明,该方法在图像去噪、图像上采样、去雾和深度图插值等应用中都是有效的。
{"title":"Misalignment-Robust Joint Filter for Cross-Modal Image Pairs","authors":"Takashi Shibata, Masayuki Tanaka, M. Okutomi","doi":"10.1109/ICCV.2017.357","DOIUrl":"https://doi.org/10.1109/ICCV.2017.357","url":null,"abstract":"Although several powerful joint filters for cross-modal image pairs have been proposed, the existing joint filters generate severe artifacts when there are misalignments between a target and a guidance images. Our goal is to generate an artifact-free output image even from the misaligned target and guidance images. We propose a novel misalignment-robust joint filter based on weight-volume-based image composition and joint-filter cost volume. Our proposed method first generates a set of translated guidances. Next, the joint-filter cost volume and a set of filtered images are computed from the target image and the set of the translated guidances. Then, a weight volume is obtained from the joint-filter cost volume while considering a spatial smoothness and a label-sparseness. The final output image is composed by fusing the set of the filtered images with the weight volume for the filtered images. The key is to generate the final output image directly from the set of the filtered images by weighted averaging using the weight volume that is obtained from the joint-filter cost volume. The proposed framework is widely applicable and can involve any kind of joint filter. Experimental results show that the proposed method is effective for various applications including image denosing, image up-sampling, haze removal and depth map interpolation.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"51 1","pages":"3315-3324"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73559190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding 面向密集视觉语义嵌入的分层多模态LSTM
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.208
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, G. Hua
We address the problem of dense visual-semantic embedding that maps not only full sentences and whole images but also phrases within sentences and salient regions within images into a multimodal embedding space. Such dense embeddings, when applied to the task of image captioning, enable us to produce several region-oriented and detailed phrases rather than just an overview sentence to describe an image. Specifically, we present a hierarchical structured recurrent neural network (RNN), namely Hierarchical Multimodal LSTM (HM-LSTM). Compared with chain structured RNN, our proposed model exploits the hierarchical relations between sentences and phrases, and between whole images and image regions, to jointly establish their representations. Without the need of any supervised labels, our proposed model automatically learns the fine-grained correspondences between phrases and image regions towards the dense embedding. Extensive experiments on several datasets validate the efficacy of our method, which compares favorably with the state-of-the-art methods.
我们解决了密集的视觉语义嵌入问题,不仅映射完整的句子和整个图像,而且映射句子中的短语和图像中的显著区域到一个多模态嵌入空间。这种密集的嵌入,当应用于图像字幕任务时,使我们能够产生几个面向区域和详细的短语,而不仅仅是一个概述句子来描述图像。具体来说,我们提出了一种分层结构递归神经网络(RNN),即分层多模态LSTM (HM-LSTM)。与链式结构RNN相比,我们提出的模型利用句子和短语之间、整个图像和图像区域之间的层次关系,共同建立它们的表示。在不需要任何监督标签的情况下,我们提出的模型自动学习短语和图像区域之间的细粒度对应关系,以实现密集嵌入。在多个数据集上进行的大量实验验证了我们的方法的有效性,该方法与最先进的方法相比具有优势。
{"title":"Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding","authors":"Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, G. Hua","doi":"10.1109/ICCV.2017.208","DOIUrl":"https://doi.org/10.1109/ICCV.2017.208","url":null,"abstract":"We address the problem of dense visual-semantic embedding that maps not only full sentences and whole images but also phrases within sentences and salient regions within images into a multimodal embedding space. Such dense embeddings, when applied to the task of image captioning, enable us to produce several region-oriented and detailed phrases rather than just an overview sentence to describe an image. Specifically, we present a hierarchical structured recurrent neural network (RNN), namely Hierarchical Multimodal LSTM (HM-LSTM). Compared with chain structured RNN, our proposed model exploits the hierarchical relations between sentences and phrases, and between whole images and image regions, to jointly establish their representations. Without the need of any supervised labels, our proposed model automatically learns the fine-grained correspondences between phrases and image regions towards the dense embedding. Extensive experiments on several datasets validate the efficacy of our method, which compares favorably with the state-of-the-art methods.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"15 1","pages":"1899-1907"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75295057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 139
Deep Facial Action Unit Recognition from Partially Labeled Data 基于部分标记数据的深度面部动作单元识别
Pub Date : 2017-10-01 DOI: 10.1109/ICCV.2017.426
Shan Wu, Shangfei Wang, Bowen Pan, Q. Ji
Current work on facial action unit (AU) recognition requires AU-labeled facial images. Although large amounts of facial images are readily available, AU annotation is expensive and time consuming. To address this, we propose a deep facial action unit recognition approach learning from partially AU-labeled data. The proposed approach makes full use of both partly available ground-truth AU labels and the readily available large scale facial images without annotation. Specifically, we propose to learn label distribution from the ground-truth AU labels, and then train the AU classifiers from the large-scale facial images by maximizing the log likelihood of the mapping functions of AUs with regard to the learnt label distribution for all training data and minimizing the error between predicted AUs and ground-truth AUs for labeled data simultaneously. A restricted Boltzmann machine is adopted to model AU label distribution, a deep neural network is used to learn facial representation from facial images, and the support vector machine is employed as the classifier. Experiments on two benchmark databases demonstrate the effectiveness of the proposed approach.
目前的面部动作单元(AU)识别工作需要AU标记的面部图像。尽管大量的面部图像是现成的,但是AU注释是昂贵且耗时的。为了解决这个问题,我们提出了一种深度面部动作单元识别方法,该方法从部分au标记的数据中学习。该方法既充分利用了部分可用的真实AU标签,又充分利用了现成的无标注的大尺度人脸图像。具体来说,我们提出从真实AU标签中学习标签分布,然后通过最大化所有训练数据的AU映射函数相对于学习到的标签分布的对数似然,同时最小化标记数据的预测AU与真实AU之间的误差,从大规模面部图像中训练AU分类器。采用受限玻尔兹曼机对AU标签分布进行建模,采用深度神经网络从人脸图像中学习人脸表征,并采用支持向量机作为分类器。在两个基准数据库上的实验证明了该方法的有效性。
{"title":"Deep Facial Action Unit Recognition from Partially Labeled Data","authors":"Shan Wu, Shangfei Wang, Bowen Pan, Q. Ji","doi":"10.1109/ICCV.2017.426","DOIUrl":"https://doi.org/10.1109/ICCV.2017.426","url":null,"abstract":"Current work on facial action unit (AU) recognition requires AU-labeled facial images. Although large amounts of facial images are readily available, AU annotation is expensive and time consuming. To address this, we propose a deep facial action unit recognition approach learning from partially AU-labeled data. The proposed approach makes full use of both partly available ground-truth AU labels and the readily available large scale facial images without annotation. Specifically, we propose to learn label distribution from the ground-truth AU labels, and then train the AU classifiers from the large-scale facial images by maximizing the log likelihood of the mapping functions of AUs with regard to the learnt label distribution for all training data and minimizing the error between predicted AUs and ground-truth AUs for labeled data simultaneously. A restricted Boltzmann machine is adopted to model AU label distribution, a deep neural network is used to learn facial representation from facial images, and the support vector machine is employed as the classifier. Experiments on two benchmark databases demonstrate the effectiveness of the proposed approach.","PeriodicalId":6559,"journal":{"name":"2017 IEEE International Conference on Computer Vision (ICCV)","volume":"19 1","pages":"3971-3979"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75348003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
2017 IEEE International Conference on Computer Vision (ICCV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1