首页 > 最新文献

2019 IEEE/CVF International Conference on Computer Vision (ICCV)最新文献

英文 中文
2019 Organizing Committee 2019年组委会
Pub Date : 2019-10-01 DOI: 10.1109/iccv.2019.00006
{"title":"2019 Organizing Committee","authors":"","doi":"10.1109/iccv.2019.00006","DOIUrl":"https://doi.org/10.1109/iccv.2019.00006","url":null,"abstract":"","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86570211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semi-Supervised Monocular 3D Face Reconstruction With End-to-End Shape-Preserved Domain Transfer 基于端到端形状保持域转移的半监督单眼三维人脸重建
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00949
Jingtan Piao, C. Qian, Hongsheng Li
Monocular face reconstruction is a challenging task in computer vision, which aims to recover 3D face geometry from a single RGB face image. Recently, deep learning based methods have achieved great improvements on monocular face reconstruction. However, for deep learning-based methods to reach optimal performance, it is paramount to have large-scale training images with ground-truth 3D face geometry, which is generally difficult for human to annotate. To tackle this problem, we propose a semi-supervised monocular reconstruction method, which jointly optimizes a shape-preserved domain-transfer CycleGAN and a shape estimation network. The framework is semi-supervised trained with 3D rendered images with ground-truth shapes and in-the-wild face images without any extra annotation. The CycleGAN network transforms all realistic images to have the rendered style and is end-to-end trained within the overall framework. This is the key difference compared with existing CycleGAN-based learning methods, which just used CycleGAN as a separate training sample generator. Novel landmark consistency loss and edge-aware shape estimation loss are proposed for our two networks to jointly solve the challenging face reconstruction problem. Extensive experiments on public face reconstruction datasets demonstrate the effectiveness of our overall method as well as the individual components.
单目人脸重建是计算机视觉领域的一项具有挑战性的任务,它旨在从单个RGB人脸图像中恢复三维人脸几何形状。近年来,基于深度学习的方法在单眼人脸重建方面取得了很大的进步。然而,为了使基于深度学习的方法达到最佳性能,最重要的是拥有具有真实三维人脸几何的大规模训练图像,这通常是人类难以注释的。为了解决这一问题,我们提出了一种半监督单目重建方法,该方法联合优化了形状保持域转移CycleGAN和形状估计网络。该框架是用3D渲染图像进行半监督训练的,这些图像具有地面真实形状和野外人脸图像,没有任何额外的注释。CycleGAN网络将所有真实图像转换为呈现风格,并在整个框架内进行端到端训练。这是与现有的基于CycleGAN的学习方法的关键区别,后者只是使用CycleGAN作为单独的训练样本生成器。为共同解决具有挑战性的人脸重建问题,我们提出了新的地标一致性损失和边缘感知形状估计损失。在公共人脸重建数据集上的大量实验证明了我们的整体方法和单个组件的有效性。
{"title":"Semi-Supervised Monocular 3D Face Reconstruction With End-to-End Shape-Preserved Domain Transfer","authors":"Jingtan Piao, C. Qian, Hongsheng Li","doi":"10.1109/ICCV.2019.00949","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00949","url":null,"abstract":"Monocular face reconstruction is a challenging task in computer vision, which aims to recover 3D face geometry from a single RGB face image. Recently, deep learning based methods have achieved great improvements on monocular face reconstruction. However, for deep learning-based methods to reach optimal performance, it is paramount to have large-scale training images with ground-truth 3D face geometry, which is generally difficult for human to annotate. To tackle this problem, we propose a semi-supervised monocular reconstruction method, which jointly optimizes a shape-preserved domain-transfer CycleGAN and a shape estimation network. The framework is semi-supervised trained with 3D rendered images with ground-truth shapes and in-the-wild face images without any extra annotation. The CycleGAN network transforms all realistic images to have the rendered style and is end-to-end trained within the overall framework. This is the key difference compared with existing CycleGAN-based learning methods, which just used CycleGAN as a separate training sample generator. Novel landmark consistency loss and edge-aware shape estimation loss are proposed for our two networks to jointly solve the challenging face reconstruction problem. Extensive experiments on public face reconstruction datasets demonstrate the effectiveness of our overall method as well as the individual components.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"11 1","pages":"9397-9406"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89790340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Spatiotemporal Feature Residual Propagation for Action Prediction 用于动作预测的时空特征残差传播
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00710
He Zhao, Richard P. Wildes
Recognizing actions from limited preliminary video observations has seen considerable recent progress. Typically, however, such progress has been had without explicitly modeling fine-grained motion evolution as a potentially valuable information source. In this study, we address this task by investigating how action patterns evolve over time in a spatial feature space. There are three key components to our system. First, we work with intermediate-layer ConvNet features, which allow for abstraction from raw data, while retaining spatial layout, which is sacrificed in approaches that rely on vectorized global representations. Second, instead of propagating features per se, we propagate their residuals across time, which allows for a compact representation that reduces redundancy while retaining essential information about evolution over time. Third, we employ a Kalman filter to combat error build-up and unify across prediction start times. Extensive experimental results on the JHMDB21, UCF101 and BIT datasets show that our approach leads to a new state-of-the-art in action prediction.
从有限的初步录像观察中识别行动最近取得了相当大的进展。然而,这种进展通常没有明确地将细粒度运动演化作为潜在有价值的信息源进行建模。在本研究中,我们通过研究动作模式在空间特征空间中如何随时间演变来解决这一任务。我们的系统有三个关键组成部分。首先,我们使用中间层卷积神经网络特征,它允许从原始数据中抽象,同时保留空间布局,这在依赖矢量化全局表示的方法中是牺牲的。其次,我们不是传播特征本身,而是传播它们的残差,这允许一个紧凑的表示,减少冗余,同时保留随时间演变的基本信息。第三,我们使用卡尔曼滤波器来对抗误差累积并统一各个预测开始时间。在JHMDB21、UCF101和BIT数据集上的大量实验结果表明,我们的方法导致了一种新的最先进的行动预测。
{"title":"Spatiotemporal Feature Residual Propagation for Action Prediction","authors":"He Zhao, Richard P. Wildes","doi":"10.1109/ICCV.2019.00710","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00710","url":null,"abstract":"Recognizing actions from limited preliminary video observations has seen considerable recent progress. Typically, however, such progress has been had without explicitly modeling fine-grained motion evolution as a potentially valuable information source. In this study, we address this task by investigating how action patterns evolve over time in a spatial feature space. There are three key components to our system. First, we work with intermediate-layer ConvNet features, which allow for abstraction from raw data, while retaining spatial layout, which is sacrificed in approaches that rely on vectorized global representations. Second, instead of propagating features per se, we propagate their residuals across time, which allows for a compact representation that reduces redundancy while retaining essential information about evolution over time. Third, we employ a Kalman filter to combat error build-up and unify across prediction start times. Extensive experimental results on the JHMDB21, UCF101 and BIT datasets show that our approach leads to a new state-of-the-art in action prediction.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"40 1","pages":"7002-7011"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90274479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Understanding Deep Networks via Extremal Perturbations and Smooth Masks 通过极端扰动和平滑掩模理解深度网络
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00304
Ruth Fong, Mandela Patrick, A. Vedaldi
Attribution is the problem of finding which parts of an image are the most responsible for the output of a deep neural network. An important family of attribution methods is based on measuring the effect of perturbations applied to the input image, either via exhaustive search or by finding representative perturbations via optimization. In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable. We also introduce a number of technical innovations to compute these extremal perturbations, including a new area constraint and a parametric family of smooth perturbations, which allow us to remove all tunable weighing factors from the optimization problem. We analyze the effect of perturbations as a function of their area, demonstrating excellent sensitivity to the spatial properties of the network under stimulation. We also extend perturbation analysis to the intermediate layers of a deep neural network. This application allows us to show how compactly an image can be represented (in terms of the number of channels it requires). We also demonstrate that the consistency with which images of a given class rely on the same intermediate channel correlates well with class accuracy.
归因问题是找出图像的哪些部分对深度神经网络的输出最负责。一个重要的归属方法家族是基于测量应用于输入图像的扰动的影响,要么通过穷举搜索,要么通过优化找到有代表性的扰动。在本文中,我们讨论了现有的摄动分析方法的一些缺点,并通过引入理论上有根据和可解释的极值摄动的概念来解决它们。我们还引入了一些技术创新来计算这些极端扰动,包括一个新的区域约束和平滑扰动的参数族,这使我们能够从优化问题中去除所有可调的权重因素。我们分析了扰动作为其面积的函数的影响,证明了对网络在刺激下的空间特性的优异敏感性。我们还将微扰分析扩展到深度神经网络的中间层。这个应用程序允许我们展示如何紧凑地表示图像(根据它所需的通道数量)。我们还证明,给定类的图像依赖于相同中间通道的一致性与类精度有很好的相关性。
{"title":"Understanding Deep Networks via Extremal Perturbations and Smooth Masks","authors":"Ruth Fong, Mandela Patrick, A. Vedaldi","doi":"10.1109/ICCV.2019.00304","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00304","url":null,"abstract":"Attribution is the problem of finding which parts of an image are the most responsible for the output of a deep neural network. An important family of attribution methods is based on measuring the effect of perturbations applied to the input image, either via exhaustive search or by finding representative perturbations via optimization. In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable. We also introduce a number of technical innovations to compute these extremal perturbations, including a new area constraint and a parametric family of smooth perturbations, which allow us to remove all tunable weighing factors from the optimization problem. We analyze the effect of perturbations as a function of their area, demonstrating excellent sensitivity to the spatial properties of the network under stimulation. We also extend perturbation analysis to the intermediate layers of a deep neural network. This application allows us to show how compactly an image can be represented (in terms of the number of channels it requires). We also demonstrate that the consistency with which images of a given class rely on the same intermediate channel correlates well with class accuracy.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"40 1","pages":"2950-2958"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78098778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 314
Learning Robust Facial Landmark Detection via Hierarchical Structured Ensemble 基于层次结构集成的鲁棒人脸特征检测
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00023
Xu Zou, Sheng Zhong, Luxin Yan, Xiangyu Zhao, Jiahuan Zhou, Ying Wu
Heatmap regression-based models have significantly advanced the progress of facial landmark detection. However, the lack of structural constraints always generates inaccurate heatmaps resulting in poor landmark detection performance. While hierarchical structure modeling methods have been proposed to tackle this issue, they all heavily rely on manually designed tree structures. The designed hierarchical structure is likely to be completely corrupted due to the missing or inaccurate prediction of landmarks. To the best of our knowledge, in the context of deep learning, no work before has investigated how to automatically model proper structures for facial landmarks, by discovering their inherent relations. In this paper, we propose a novel Hierarchical Structured Landmark Ensemble (HSLE) model for learning robust facial landmark detection, by using it as the structural constraints. Different from existing approaches of manually designing structures, our proposed HSLE model is constructed automatically via discovering the most robust patterns so HSLE has the ability to robustly depict both local and holistic landmark structures simultaneously. Our proposed HSLE can be readily plugged into any existing facial landmark detection baselines for further performance improvement. Extensive experimental results demonstrate our approach significantly outperforms the baseline by a large margin to achieve a state-of-the-art performance.
基于热图回归的模型极大地推动了人脸地标检测的进展。然而,缺乏结构约束总是产生不准确的热图,导致较差的地标检测性能。虽然已经提出了分层结构建模方法来解决这个问题,但它们都严重依赖于人工设计的树形结构。由于地标的缺失或不准确的预测,所设计的层次结构很可能被完全破坏。据我们所知,在深度学习的背景下,之前没有研究过如何通过发现面部标志的内在关系来自动为面部标志的适当结构建模。在本文中,我们提出了一种新的分层结构地标集成(HSLE)模型,将其作为学习鲁棒面部地标检测的结构约束。与现有的人工设计结构的方法不同,我们提出的HSLE模型是通过发现最鲁棒的模式来自动构建的,因此HSLE能够同时鲁棒地描述局部和整体地标结构。我们提出的HSLE可以很容易地插入任何现有的面部地标检测基线,以进一步提高性能。广泛的实验结果表明,我们的方法明显优于基线,以实现最先进的性能。
{"title":"Learning Robust Facial Landmark Detection via Hierarchical Structured Ensemble","authors":"Xu Zou, Sheng Zhong, Luxin Yan, Xiangyu Zhao, Jiahuan Zhou, Ying Wu","doi":"10.1109/ICCV.2019.00023","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00023","url":null,"abstract":"Heatmap regression-based models have significantly advanced the progress of facial landmark detection. However, the lack of structural constraints always generates inaccurate heatmaps resulting in poor landmark detection performance. While hierarchical structure modeling methods have been proposed to tackle this issue, they all heavily rely on manually designed tree structures. The designed hierarchical structure is likely to be completely corrupted due to the missing or inaccurate prediction of landmarks. To the best of our knowledge, in the context of deep learning, no work before has investigated how to automatically model proper structures for facial landmarks, by discovering their inherent relations. In this paper, we propose a novel Hierarchical Structured Landmark Ensemble (HSLE) model for learning robust facial landmark detection, by using it as the structural constraints. Different from existing approaches of manually designing structures, our proposed HSLE model is constructed automatically via discovering the most robust patterns so HSLE has the ability to robustly depict both local and holistic landmark structures simultaneously. Our proposed HSLE can be readily plugged into any existing facial landmark detection baselines for further performance improvement. Extensive experimental results demonstrate our approach significantly outperforms the baseline by a large margin to achieve a state-of-the-art performance.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"141-150"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78522419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis 用手说话16.2M:用于会话运动分析和合成的同步身体-手指运动和音频的大规模数据集
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00085
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, S. Srinivasa, Yaser Sheikh
We present a 16.2-million frame (50-hour) multimodal dataset of two-person face-to-face spontaneous conversations. Our dataset features synchronized body and finger motion as well as audio data. To the best of our knowledge, it represents the largest motion capture and audio dataset of natural conversations to date. The statistical analysis verifies strong intraperson and interperson covariance of arm, hand, and speech features, potentially enabling new directions on data-driven social behavior analysis, prediction, and synthesis. As an illustration, we propose a novel real-time finger motion synthesis method: a temporal neural network innovatively trained with an inverse kinematics (IK) loss, which adds skeletal structural information to the generative model. Our qualitative user study shows that the finger motion generated by our method is perceived as natural and conversation enhancing, while the quantitative ablation study demonstrates the effectiveness of IK loss.
我们提出了一个1620万帧(50小时)的双人面对面自发对话的多模式数据集。我们的数据集具有同步的身体和手指运动以及音频数据。据我们所知,它代表了迄今为止最大的自然对话的动作捕捉和音频数据集。统计分析验证了手臂,手和语音特征的强内部和人际协方差,可能为数据驱动的社会行为分析,预测和综合提供新的方向。为了说明这一点,我们提出了一种新的实时手指运动合成方法:一种创新地使用逆运动学(IK)损失训练的时间神经网络,它将骨骼结构信息添加到生成模型中。我们的定性用户研究表明,通过我们的方法产生的手指运动被认为是自然的,并增强了会话,而定量消融研究表明了IK损失的有效性。
{"title":"Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis","authors":"Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, S. Srinivasa, Yaser Sheikh","doi":"10.1109/ICCV.2019.00085","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00085","url":null,"abstract":"We present a 16.2-million frame (50-hour) multimodal dataset of two-person face-to-face spontaneous conversations. Our dataset features synchronized body and finger motion as well as audio data. To the best of our knowledge, it represents the largest motion capture and audio dataset of natural conversations to date. The statistical analysis verifies strong intraperson and interperson covariance of arm, hand, and speech features, potentially enabling new directions on data-driven social behavior analysis, prediction, and synthesis. As an illustration, we propose a novel real-time finger motion synthesis method: a temporal neural network innovatively trained with an inverse kinematics (IK) loss, which adds skeletal structural information to the generative model. Our qualitative user study shows that the finger motion generated by our method is perceived as natural and conversation enhancing, while the quantitative ablation study demonstrates the effectiveness of IK loss.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"14 1","pages":"763-772"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75228908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Flare in Interference-Based Hyperspectral Cameras 干涉型高光谱相机中的耀斑
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.01027
Eden Sassoon, T. Treibitz, Y. Schechner
Stray light (flare) is formed inside cameras by internal reflections between optical elements. We point out a flare effect of significant magnitude and implication to snapshot hyperspectral imagers. Recent technologies enable placing interference-based filters on individual pixels in imaging sensors. These filters have narrow transmission bands around custom wavelengths and high transmission efficiency. Cameras using arrays of such filters are compact, robust and fast. However, as opposed to traditional broad-band filters, which often absorb unwanted light, narrow band-pass interference filters reflect non-transmitted light. This is a source of very significant flare which biases hyperspectral measurements. The bias in any pixel depends on spectral content in other pixels. We present a theoretical image formation model for this effect, and quantify it through simulations and experiments. In addition, we test deflaring of signals affected by such flare.
杂散光(耀斑)是由相机内部光学元件之间的反射形成的。我们指出了耀斑效应的显著幅度和暗示的快照高光谱成像仪。最近的技术可以在成像传感器的单个像素上放置基于干涉的滤波器。这些滤波器在定制波长周围具有窄的传输带和高的传输效率。使用这种滤光片阵列的相机结构紧凑、坚固、快速。然而,与通常吸收不需要的光的传统宽带滤光片相反,窄带通干涉滤光片反射非透射光。这是一个非常重要的耀斑源,会使高光谱测量产生偏差。任何像素的偏置取决于其他像素的光谱含量。我们提出了这种效应的理论图像形成模型,并通过模拟和实验对其进行了量化。此外,我们还测试了受这种耀斑影响的信号的燃烧。
{"title":"Flare in Interference-Based Hyperspectral Cameras","authors":"Eden Sassoon, T. Treibitz, Y. Schechner","doi":"10.1109/ICCV.2019.01027","DOIUrl":"https://doi.org/10.1109/ICCV.2019.01027","url":null,"abstract":"Stray light (flare) is formed inside cameras by internal reflections between optical elements. We point out a flare effect of significant magnitude and implication to snapshot hyperspectral imagers. Recent technologies enable placing interference-based filters on individual pixels in imaging sensors. These filters have narrow transmission bands around custom wavelengths and high transmission efficiency. Cameras using arrays of such filters are compact, robust and fast. However, as opposed to traditional broad-band filters, which often absorb unwanted light, narrow band-pass interference filters reflect non-transmitted light. This is a source of very significant flare which biases hyperspectral measurements. The bias in any pixel depends on spectral content in other pixels. We present a theoretical image formation model for this effect, and quantify it through simulations and experiments. In addition, we test deflaring of signals affected by such flare.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"15 1","pages":"10173-10181"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75269860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection 利用视频目标检测提案之间的长期时间关系
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00985
Mykhailo Shvets, Wei Liu, A. Berg
Single-frame object detectors perform well on videos sometimes, even without temporal context. However, challenges such as occlusion, motion blur, and rare poses of objects are hard to resolve without temporal awareness. Thus, there is a strong need to improve video object detection by considering long-range temporal dependencies. In this paper, we present a light-weight modification to a single-frame detector that accounts for arbitrary long dependencies in a video. It improves the accuracy of a single-frame detector significantly with negligible compute overhead. The key component of our approach is a novel temporal relation module, operating on object proposals, that learns the similarities between proposals from different frames and selects proposals from past and/or future to support current proposals. Our final “causal" model, without any offline post-processing steps, runs at a similar speed as a single-frame detector and achieves state-of-the-art video object detection on ImageNet VID dataset.
单帧目标检测器有时在视频上表现良好,即使没有时间背景。然而,诸如遮挡、运动模糊和罕见的物体姿势等挑战很难在没有时间感知的情况下解决。因此,迫切需要通过考虑长期时间依赖性来改进视频目标检测。在本文中,我们提出了一种轻量级的修改单帧检测器,该检测器考虑了视频中任意长的依赖关系。它显著提高了单帧检测器的精度,而计算开销可以忽略不计。我们的方法的关键部分是一个新颖的时间关系模块,它对对象提案进行操作,学习来自不同框架的提案之间的相似性,并从过去和/或未来选择提案来支持当前提案。我们最终的“因果”模型,没有任何离线后处理步骤,以与单帧检测器相似的速度运行,并在ImageNet VID数据集上实现了最先进的视频对象检测。
{"title":"Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection","authors":"Mykhailo Shvets, Wei Liu, A. Berg","doi":"10.1109/ICCV.2019.00985","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00985","url":null,"abstract":"Single-frame object detectors perform well on videos sometimes, even without temporal context. However, challenges such as occlusion, motion blur, and rare poses of objects are hard to resolve without temporal awareness. Thus, there is a strong need to improve video object detection by considering long-range temporal dependencies. In this paper, we present a light-weight modification to a single-frame detector that accounts for arbitrary long dependencies in a video. It improves the accuracy of a single-frame detector significantly with negligible compute overhead. The key component of our approach is a novel temporal relation module, operating on object proposals, that learns the similarities between proposals from different frames and selects proposals from past and/or future to support current proposals. Our final “causal\" model, without any offline post-processing steps, runs at a similar speed as a single-frame detector and achieves state-of-the-art video object detection on ImageNet VID dataset.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"31 1","pages":"9755-9763"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75711059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading 基于时空融合的卷积序列学习唇读算法
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00080
Xingxuan Zhang, Feng Cheng, Shilin Wang
Current state-of-the-art approaches for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition. Hence, these methods do not fully exploit the characteristics of the lip dynamics, causing two main drawbacks. First, the short-range temporal dependencies, which are critical to the mapping from lip images to visemes, receives no extra attention. Second, local spatial information is discarded in the existing sequence models due to the use of global average pooling (GAP). To well solve these drawbacks, we propose a Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well. From the experiment results, it is demonstrated that our method achieves comparable performance with the state-of-the-art approach using much less training data and much lighter Convolutional Feature Extractor. The training time is reduced by 12 days due to the convolutional structure and the local self-attention mechanism.
目前最先进的唇读方法是基于为自然机器翻译和音频语音识别而设计的序列到序列架构。因此,这些方法不能充分利用唇的动力学特性,造成两个主要的缺点。首先,对于唇形图像到视觉的映射至关重要的短时时间依赖性没有得到额外的关注。其次,由于使用全局平均池化(GAP),现有序列模型中局部空间信息被丢弃。为了很好地解决这些问题,我们提出了一个时间焦点块来充分描述短程依赖关系,一个时空融合模块(STFM)来保持局部空间信息并降低特征维数。实验结果表明,我们的方法使用更少的训练数据和更轻的卷积特征提取器,达到了与最先进的方法相当的性能。由于卷积结构和局部自注意机制,训练时间减少了12天。
{"title":"Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading","authors":"Xingxuan Zhang, Feng Cheng, Shilin Wang","doi":"10.1109/ICCV.2019.00080","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00080","url":null,"abstract":"Current state-of-the-art approaches for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition. Hence, these methods do not fully exploit the characteristics of the lip dynamics, causing two main drawbacks. First, the short-range temporal dependencies, which are critical to the mapping from lip images to visemes, receives no extra attention. Second, local spatial information is discarded in the existing sequence models due to the use of global average pooling (GAP). To well solve these drawbacks, we propose a Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well. From the experiment results, it is demonstrated that our method achieves comparable performance with the state-of-the-art approach using much less training data and much lighter Convolutional Feature Extractor. The training time is reduced by 12 days due to the convolutional structure and the local self-attention mechanism.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"71 1","pages":"713-722"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74700988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification Using Highly Randomized Synthetic Data 基于高度随机合成数据的姿态感知多任务学习车辆再识别
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00030
Zheng Tang, M. Naphade, Stan Birchfield, Jonathan Tremblay, William Hodge, Ratnesh Kumar, Shuo Wang, Xiaodong Yang
In comparison with person re-identification (ReID), which has been widely studied in the research community, vehicle ReID has received less attention. Vehicle ReID is challenging due to 1) high intra-class variability (caused by the dependency of shape and appearance on viewpoint), and 2) small inter-class variability (caused by the similarity in shape and appearance between vehicles produced by different manufacturers). To address these challenges, we propose a Pose-Aware Multi-Task Re-Identification (PAMTRI) framework. This approach includes two innovations compared with previous methods. First, it overcomes viewpoint-dependency by explicitly reasoning about vehicle pose and shape via keypoints, heatmaps and segments from pose estimation. Second, it jointly classifies semantic vehicle attributes (colors and types) while performing ReID, through multi-task learning with the embedded pose representations. Since manually labeling images with detailed pose and attribute information is prohibitive, we create a large-scale highly randomized synthetic dataset with automatically annotated vehicle attributes for training. Extensive experiments validate the effectiveness of each proposed component, showing that PAMTRI achieves significant improvement over state-of-the-art on two mainstream vehicle ReID benchmarks: VeRi and CityFlow-ReID.
与学术界广泛研究的人再识别(ReID)相比,车辆再识别(ReID)受到的关注较少。车辆ReID具有挑战性,因为1)类内变异性高(由形状和外观对视点的依赖性造成),2)类间变异性小(由不同制造商生产的车辆在形状和外观上的相似性造成)。为了解决这些挑战,我们提出了一个姿态感知多任务重新识别(PAMTRI)框架。与以往的方法相比,该方法有两个创新之处。首先,它通过姿态估计中的关键点、热图和片段来明确地推理车辆姿态和形状,从而克服了视点依赖性。其次,在执行ReID的同时,通过对嵌入的姿态表示进行多任务学习,对车辆的语义属性(颜色和类型)进行联合分类。由于手动标记带有详细姿态和属性信息的图像是禁止的,我们创建了一个具有自动注释车辆属性的大规模高度随机合成数据集用于训练。大量的实验验证了每个提议组件的有效性,表明PAMTRI在两个主流车辆ReID基准(VeRi和CityFlow-ReID)上取得了显著的进步。
{"title":"PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification Using Highly Randomized Synthetic Data","authors":"Zheng Tang, M. Naphade, Stan Birchfield, Jonathan Tremblay, William Hodge, Ratnesh Kumar, Shuo Wang, Xiaodong Yang","doi":"10.1109/ICCV.2019.00030","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00030","url":null,"abstract":"In comparison with person re-identification (ReID), which has been widely studied in the research community, vehicle ReID has received less attention. Vehicle ReID is challenging due to 1) high intra-class variability (caused by the dependency of shape and appearance on viewpoint), and 2) small inter-class variability (caused by the similarity in shape and appearance between vehicles produced by different manufacturers). To address these challenges, we propose a Pose-Aware Multi-Task Re-Identification (PAMTRI) framework. This approach includes two innovations compared with previous methods. First, it overcomes viewpoint-dependency by explicitly reasoning about vehicle pose and shape via keypoints, heatmaps and segments from pose estimation. Second, it jointly classifies semantic vehicle attributes (colors and types) while performing ReID, through multi-task learning with the embedded pose representations. Since manually labeling images with detailed pose and attribute information is prohibitive, we create a large-scale highly randomized synthetic dataset with automatically annotated vehicle attributes for training. Extensive experiments validate the effectiveness of each proposed component, showing that PAMTRI achieves significant improvement over state-of-the-art on two mainstream vehicle ReID benchmarks: VeRi and CityFlow-ReID.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"37 1","pages":"211-220"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73719911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 128
期刊
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1