首页 > 最新文献

2019 IEEE/CVF International Conference on Computer Vision (ICCV)最新文献

英文 中文
Self-Critical Attention Learning for Person Re-Identification 自我批评注意学习对自我再认同的影响
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00973
Guangyi Chen, Chunze Lin, Liangliang Ren, Jiwen Lu, Jie Zhou
In this paper, we propose a self-critical attention learning method for person re-identification. Unlike most existing methods which train the attention mechanism in a weakly-supervised manner and ignore the attention confidence level, we learn the attention with a critic which measures the attention quality and provides a powerful supervisory signal to guide the learning process. Moreover, the critic model facilitates the interpretation of the effectiveness of the attention mechanism during the learning process, by estimating the quality of the attention maps. Specifically, we jointly train our attention agent and critic in a reinforcement learning manner, where the agent produces the visual attention while the critic analyzes the gain from the attention and guides the agent to maximize this gain. We design spatial- and channel-wise attention models with our critic module and evaluate them on three popular benchmarks including Market-1501, DukeMTMC-ReID, and CUHK03. The experimental results demonstrate the superiority of our method, which outperforms the state-of-the-art methods by a large margin of 5.9%/2.1%, 6.3%/3.0%, and 10.5%/9.5% on mAP/Rank-1, respectively.
在本文中,我们提出了一种自我批判的注意学习方法。不同于大多数现有方法以弱监督的方式训练注意机制,忽略了注意置信度,我们通过一个批评家来学习注意,批评家测量注意质量,并提供一个强大的监督信号来指导学习过程。此外,批评模型通过估计注意图的质量,有助于解释学习过程中注意机制的有效性。具体来说,我们以强化学习的方式共同训练我们的注意力代理和评论家,其中代理产生视觉注意力,而评论家分析从注意力中获得的收益,并指导代理最大化这一收益。我们用我们的评论模块设计了空间和渠道方面的注意力模型,并在三个流行的基准上进行评估,包括Market-1501、DukeMTMC-ReID和CUHK03。实验结果证明了我们的方法的优越性,在mAP/Rank-1上分别比现有的方法高出5.9%/2.1%、6.3%/3.0%和10.5%/9.5%。
{"title":"Self-Critical Attention Learning for Person Re-Identification","authors":"Guangyi Chen, Chunze Lin, Liangliang Ren, Jiwen Lu, Jie Zhou","doi":"10.1109/ICCV.2019.00973","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00973","url":null,"abstract":"In this paper, we propose a self-critical attention learning method for person re-identification. Unlike most existing methods which train the attention mechanism in a weakly-supervised manner and ignore the attention confidence level, we learn the attention with a critic which measures the attention quality and provides a powerful supervisory signal to guide the learning process. Moreover, the critic model facilitates the interpretation of the effectiveness of the attention mechanism during the learning process, by estimating the quality of the attention maps. Specifically, we jointly train our attention agent and critic in a reinforcement learning manner, where the agent produces the visual attention while the critic analyzes the gain from the attention and guides the agent to maximize this gain. We design spatial- and channel-wise attention models with our critic module and evaluate them on three popular benchmarks including Market-1501, DukeMTMC-ReID, and CUHK03. The experimental results demonstrate the superiority of our method, which outperforms the state-of-the-art methods by a large margin of 5.9%/2.1%, 6.3%/3.0%, and 10.5%/9.5% on mAP/Rank-1, respectively.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"8 1","pages":"9636-9645"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84573802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 115
Gaussian Affinity for Max-Margin Class Imbalanced Learning 最大边际类不平衡学习的高斯关联
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00657
Munawar Hayat, Salman Hameed Khan, Syed Waqas Zamir, Jianbing Shen, Ling Shao
Real-world object classes appear in imbalanced ratios. This poses a significant challenge for classifiers which get biased towards frequent classes. We hypothesize that improving the generalization capability of a classifier should improve learning on imbalanced datasets. Here, we introduce the first hybrid loss function that jointly performs classification and clustering in a single formulation. Our approach is based on an `affinity measure' in Euclidean space that leads to the following benefits: (1) direct enforcement of maximum margin constraints on classification boundaries, (2) a tractable way to ensure uniformly spaced and equidistant cluster centers, (3) flexibility to learn multiple class prototypes to support diversity and discriminability in feature space. Our extensive experiments demonstrate the significant performance improvements on visual classification and verification tasks on multiple imbalanced datasets. The proposed loss can easily be plugged in any deep architecture as a differentiable block and demonstrates robustness against different levels of data imbalance and corrupted labels.
现实世界的对象类以不平衡的比例出现。这对倾向于频繁类的分类器提出了重大挑战。我们假设,提高分类器的泛化能力应该提高对不平衡数据集的学习。在这里,我们引入了第一个混合损失函数,它在一个公式中联合执行分类和聚类。我们的方法基于欧几里得空间中的“亲和度度量”,具有以下优点:(1)直接执行分类边界的最大边界约束,(2)确保均匀间隔和等距聚类中心的易于处理的方法,(3)灵活地学习多个类原型以支持特征空间中的多样性和可判别性。我们的大量实验证明了在多个不平衡数据集上视觉分类和验证任务的显著性能改进。所提出的损失可以很容易地作为一个可微块插入到任何深层架构中,并且对不同级别的数据不平衡和损坏标签具有鲁棒性。
{"title":"Gaussian Affinity for Max-Margin Class Imbalanced Learning","authors":"Munawar Hayat, Salman Hameed Khan, Syed Waqas Zamir, Jianbing Shen, Ling Shao","doi":"10.1109/ICCV.2019.00657","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00657","url":null,"abstract":"Real-world object classes appear in imbalanced ratios. This poses a significant challenge for classifiers which get biased towards frequent classes. We hypothesize that improving the generalization capability of a classifier should improve learning on imbalanced datasets. Here, we introduce the first hybrid loss function that jointly performs classification and clustering in a single formulation. Our approach is based on an `affinity measure' in Euclidean space that leads to the following benefits: (1) direct enforcement of maximum margin constraints on classification boundaries, (2) a tractable way to ensure uniformly spaced and equidistant cluster centers, (3) flexibility to learn multiple class prototypes to support diversity and discriminability in feature space. Our extensive experiments demonstrate the significant performance improvements on visual classification and verification tasks on multiple imbalanced datasets. The proposed loss can easily be plugged in any deep architecture as a differentiable block and demonstrates robustness against different levels of data imbalance and corrupted labels.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"171 1","pages":"6468-6478"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79423851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Generative Adversarial Training for Weakly Supervised Cloud Matting 弱监督云抠图的生成对抗训练
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00029
Zhengxia Zou, Wenyuan Li, Tianyang Shi, Zhenwei Shi, Jieping Ye
The detection and removal of cloud in remote sensing images are essential for earth observation applications. Most previous methods consider cloud detection as a pixel-wise semantic segmentation process (cloud v.s. background), which inevitably leads to a category-ambiguity problem when dealing with semi-transparent clouds. We re-examine the cloud detection under a totally different point of view, i.e. to formulate it as a mixed energy separation process between foreground and background images, which can be equivalently implemented under an image matting paradigm with a clear physical significance. We further propose a generative adversarial framework where the training of our model neither requires any pixel-wise ground truth reference nor any additional user interactions. Our model consists of three networks, a cloud generator G, a cloud discriminator D, and a cloud matting network F, where G and D aim to generate realistic and physically meaningful cloud images by adversarial training, and F learns to predict the cloud reflectance and attenuation. Experimental results on a global set of satellite images demonstrate that our method, without ever using any pixel-wise ground truth during training, achieves comparable and even higher accuracy over other fully supervised methods, including some recent popular cloud detectors and some well-known semantic segmentation frameworks.
遥感影像中云的检测和去除是对地观测应用的关键。大多数以前的方法将云检测视为逐像素的语义分割过程(云与背景),这在处理半透明云时不可避免地导致类别模糊问题。我们从一个完全不同的角度重新审视云检测,即将其表述为前景和背景图像之间的混合能量分离过程,这一过程可以等效地在具有明确物理意义的图像抠图范式下实现。我们进一步提出了一个生成对抗框架,其中我们模型的训练既不需要任何像素级的真实参考,也不需要任何额外的用户交互。我们的模型由三个网络组成,一个云生成器G,一个云鉴别器D和一个云抠图网络F,其中G和D旨在通过对抗性训练生成真实的和物理上有意义的云图像,F学习预测云的反射率和衰减。在全球卫星图像集上的实验结果表明,我们的方法在训练过程中没有使用任何像素级的地面真值,与其他完全监督的方法(包括一些最近流行的云检测器和一些知名的语义分割框架)相比,达到了相当甚至更高的精度。
{"title":"Generative Adversarial Training for Weakly Supervised Cloud Matting","authors":"Zhengxia Zou, Wenyuan Li, Tianyang Shi, Zhenwei Shi, Jieping Ye","doi":"10.1109/ICCV.2019.00029","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00029","url":null,"abstract":"The detection and removal of cloud in remote sensing images are essential for earth observation applications. Most previous methods consider cloud detection as a pixel-wise semantic segmentation process (cloud v.s. background), which inevitably leads to a category-ambiguity problem when dealing with semi-transparent clouds. We re-examine the cloud detection under a totally different point of view, i.e. to formulate it as a mixed energy separation process between foreground and background images, which can be equivalently implemented under an image matting paradigm with a clear physical significance. We further propose a generative adversarial framework where the training of our model neither requires any pixel-wise ground truth reference nor any additional user interactions. Our model consists of three networks, a cloud generator G, a cloud discriminator D, and a cloud matting network F, where G and D aim to generate realistic and physically meaningful cloud images by adversarial training, and F learns to predict the cloud reflectance and attenuation. Experimental results on a global set of satellite images demonstrate that our method, without ever using any pixel-wise ground truth during training, achieves comparable and even higher accuracy over other fully supervised methods, including some recent popular cloud detectors and some well-known semantic segmentation frameworks.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"56 1","pages":"201-210"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79823432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
2019 Area Chairs 2019年地区主席
Pub Date : 2019-10-01 DOI: 10.1109/iccv.2019.00007
{"title":"2019 Area Chairs","authors":"","doi":"10.1109/iccv.2019.00007","DOIUrl":"https://doi.org/10.1109/iccv.2019.00007","url":null,"abstract":"","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85311890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching 面向少镜头图像和句子匹配的对齐跨模态记忆
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00587
Yan Huang, Liang Wang
Image and sentence matching has drawn much attention recently, but due to the lack of sufficient pairwise data for training, most previous methods still cannot well associate those challenging pairs of images and sentences containing rarely appeared regions and words, i.e., few-shot content. In this work, we study this challenging scenario as few-shot image and sentence matching, and accordingly propose an Aligned Cross-Modal Memory (ACMM) model to memorize the rarely appeared content. Given a pair of image and sentence, the model first includes an aligned memory controller network to produce two sets of semantically-comparable interface vectors through cross-modal alignment. Then the interface vectors are used by modality-specific read and update operations to alternatively interact with shared memory items. The memory items persistently memorize cross-modal shared semantic representations, which can be addressed out to better enhance the representation of few-shot content. We apply the proposed model to both conventional and few-shot image and sentence matching tasks, and demonstrate its effectiveness by achieving the state-of-the-art performance on two benchmark datasets.
近年来,图像与句子的匹配备受关注,但由于缺乏足够的成对训练数据,以往的大多数方法仍然不能很好地将那些包含很少出现的区域和单词的具有挑战性的图像和句子对进行关联,即很少拍摄的内容。在这项工作中,我们将这一具有挑战性的场景研究为少镜头图像和句子匹配,并相应地提出了对齐跨模态记忆(ACMM)模型来记忆很少出现的内容。给定一对图像和句子,该模型首先包含对齐的内存控制器网络,通过跨模态对齐产生两组语义可比较的接口向量。然后,特定于模式的读取和更新操作使用接口向量来替代地与共享内存项交互。记忆项持久地记忆跨模态的共享语义表征,可以对其进行寻址,以更好地增强对少镜头内容的表征。我们将所提出的模型应用于传统和少数镜头图像和句子匹配任务,并通过在两个基准数据集上实现最先进的性能来证明其有效性。
{"title":"ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching","authors":"Yan Huang, Liang Wang","doi":"10.1109/ICCV.2019.00587","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00587","url":null,"abstract":"Image and sentence matching has drawn much attention recently, but due to the lack of sufficient pairwise data for training, most previous methods still cannot well associate those challenging pairs of images and sentences containing rarely appeared regions and words, i.e., few-shot content. In this work, we study this challenging scenario as few-shot image and sentence matching, and accordingly propose an Aligned Cross-Modal Memory (ACMM) model to memorize the rarely appeared content. Given a pair of image and sentence, the model first includes an aligned memory controller network to produce two sets of semantically-comparable interface vectors through cross-modal alignment. Then the interface vectors are used by modality-specific read and update operations to alternatively interact with shared memory items. The memory items persistently memorize cross-modal shared semantic representations, which can be addressed out to better enhance the representation of few-shot content. We apply the proposed model to both conventional and few-shot image and sentence matching tasks, and demonstrate its effectiveness by achieving the state-of-the-art performance on two benchmark datasets.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"2674 1","pages":"5773-5782"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83743433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Deep Contextual Attention for Human-Object Interaction Detection 基于深度上下文关注的人-物交互检测
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00579
Tiancai Wang, R. Anwer, M. H. Khan, F. Khan, Yanwei Pang, Ling Shao, Jorma T. Laaksonen
Human-object interaction detection is an important and relatively new class of visual relationship detection tasks, essential for deeper scene understanding. Most existing approaches decompose the problem into object localization and interaction recognition. Despite showing progress, these approaches only rely on the appearances of humans and objects and overlook the available context information, crucial for capturing subtle interactions between them. We propose a contextual attention framework for human-object interaction detection. Our approach leverages context by learning contextually-aware appearance features for human and object instances. The proposed attention module then adaptively selects relevant instance-centric context information to highlight image regions likely to contain human-object interactions. Experiments are performed on three benchmarks: V-COCO, HICO-DET and HCVRD. Our approach outperforms the state-of-the-art on all datasets. On the V-COCO dataset, our method achieves a relative gain of 4.4% in terms of role mean average precision (mAP role ), compared to the existing best approach.
人-物交互检测是一类重要且相对较新的视觉关系检测任务,对于更深入的场景理解至关重要。现有的大多数方法将问题分解为目标定位和交互识别。尽管取得了进展,但这些方法只依赖于人和物体的外观,而忽略了可用的上下文信息,而上下文信息对于捕捉它们之间微妙的相互作用至关重要。我们提出了一个用于人-物交互检测的上下文注意框架。我们的方法通过学习人类和对象实例的上下文感知外观特征来利用上下文。然后,提出的注意力模块自适应地选择相关的以实例为中心的上下文信息,以突出显示可能包含人-对象交互的图像区域。在V-COCO、HICO-DET和HCVRD三个基准上进行了实验。我们的方法在所有数据集上都优于最先进的方法。在V-COCO数据集上,与现有的最佳方法相比,我们的方法在角色平均精度(mAP角色)方面实现了4.4%的相对增益。
{"title":"Deep Contextual Attention for Human-Object Interaction Detection","authors":"Tiancai Wang, R. Anwer, M. H. Khan, F. Khan, Yanwei Pang, Ling Shao, Jorma T. Laaksonen","doi":"10.1109/ICCV.2019.00579","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00579","url":null,"abstract":"Human-object interaction detection is an important and relatively new class of visual relationship detection tasks, essential for deeper scene understanding. Most existing approaches decompose the problem into object localization and interaction recognition. Despite showing progress, these approaches only rely on the appearances of humans and objects and overlook the available context information, crucial for capturing subtle interactions between them. We propose a contextual attention framework for human-object interaction detection. Our approach leverages context by learning contextually-aware appearance features for human and object instances. The proposed attention module then adaptively selects relevant instance-centric context information to highlight image regions likely to contain human-object interactions. Experiments are performed on three benchmarks: V-COCO, HICO-DET and HCVRD. Our approach outperforms the state-of-the-art on all datasets. On the V-COCO dataset, our method achieves a relative gain of 4.4% in terms of role mean average precision (mAP role ), compared to the existing best approach.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"54 1","pages":"5693-5701"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83977953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 103
Enhancing Low Light Videos by Exploring High Sensitivity Camera Noise 通过探索高灵敏度相机噪声增强低光视频
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00421
Wei Wang, Xin Chen, Cheng Yang, Xiang Li, Xue-mei Hu, Tao Yue
Enhancing low light videos, which consists of denoising and brightness adjustment, is an intriguing but knotty problem. Under low light condition, due to high sensitivity camera setting, commonly negligible noises become obvious and severely deteriorate the captured videos. To recover high quality videos, a mass of image/video denoising/enhancing algorithms are proposed, most of which follow a set of simple assumptions about the statistic characters of camera noise, e.g., independent and identically distributed(i.i.d.), white, additive, Gaussian, Poisson or mixture noises. However, the practical noise under high sensitivity setting in real captured videos is complex and inaccurate to model with these assumptions. In this paper, we explore the physical origins of the practical high sensitivity noise in digital cameras, model them mathematically, and propose to enhance the low light videos based on the noise model by using an LSTM-based neural network. Specifically, we generate the training data with the proposed noise model and train the network with the dark noisy video as input and clear-bright video as output. Extensive comparisons on both synthetic and real captured low light videos with the state-of-the-art methods are conducted to demonstrate the effectiveness of the proposed method.
弱光视频的增强,包括去噪和亮度调整,是一个有趣但棘手的问题。在弱光条件下,由于高感光度相机的设置,通常可以忽略的噪声变得明显,严重影响了拍摄的视频。为了恢复高质量的视频,人们提出了大量的图像/视频去噪/增强算法,其中大多数算法都遵循一组关于摄像机噪声统计特征的简单假设,例如独立同分布(i.i.d)、白噪声、加性噪声、高斯噪声、泊松噪声或混合噪声。然而,在实际捕获的视频中,在高灵敏度设置下的实际噪声是复杂的,用这些假设来建模是不准确的。本文探讨了数码相机中实际高灵敏度噪声的物理来源,建立了高灵敏度噪声的数学模型,并提出了一种基于lstm的神经网络在噪声模型的基础上增强弱光视频的方法。具体来说,我们使用所提出的噪声模型生成训练数据,并以暗噪声视频为输入,明亮视频为输出训练网络。对合成和真实捕获的低光视频与最先进的方法进行了广泛的比较,以证明所提出方法的有效性。
{"title":"Enhancing Low Light Videos by Exploring High Sensitivity Camera Noise","authors":"Wei Wang, Xin Chen, Cheng Yang, Xiang Li, Xue-mei Hu, Tao Yue","doi":"10.1109/ICCV.2019.00421","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00421","url":null,"abstract":"Enhancing low light videos, which consists of denoising and brightness adjustment, is an intriguing but knotty problem. Under low light condition, due to high sensitivity camera setting, commonly negligible noises become obvious and severely deteriorate the captured videos. To recover high quality videos, a mass of image/video denoising/enhancing algorithms are proposed, most of which follow a set of simple assumptions about the statistic characters of camera noise, e.g., independent and identically distributed(i.i.d.), white, additive, Gaussian, Poisson or mixture noises. However, the practical noise under high sensitivity setting in real captured videos is complex and inaccurate to model with these assumptions. In this paper, we explore the physical origins of the practical high sensitivity noise in digital cameras, model them mathematically, and propose to enhance the low light videos based on the noise model by using an LSTM-based neural network. Specifically, we generate the training data with the proposed noise model and train the network with the dark noisy video as input and clear-bright video as output. Extensive comparisons on both synthetic and real captured low light videos with the state-of-the-art methods are conducted to demonstrate the effectiveness of the proposed method.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"60 1","pages":"4110-4118"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84018931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Self-Supervised Moving Vehicle Tracking With Stereo Sound 带有立体声的自监督移动车辆跟踪
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00715
Chuang Gan, Hang Zhao, Peihao Chen, David D. Cox, A. Torralba
Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground truth annotations. In particular, we propose a framework that consists of a vision ``teacher'' network and a stereo-sound ``student'' network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization using just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Auditory Vehicles Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.
人类能够使用视觉和听觉线索来定位环境中的物体,将来自多种模式的信息整合到一个共同的参考框架中。我们介绍了一个系统,该系统可以利用未标记的视听数据来学习在视觉参考框架中定位物体(移动的车辆),在推理时纯粹使用立体声。由于手动标注音频和对象边界框之间的对应关系是劳动密集型的,我们通过在未标记的视频中使用视频流和音频流的共出现作为一种自我监督的形式来实现这一目标,而不依赖于收集地面事实注释。特别是,我们提出了一个由视觉“教师”网络和立体声“学生”网络组成的框架。在训练过程中,利用未标记的视频作为桥梁,将包含在已建立的视觉车辆检测模型中的知识转移到音频领域。在测试时,立体声学生网络可以独立工作,只使用立体声音频和相机元数据来执行对象定位,而不需要任何视觉输入。在新收集的听觉车辆跟踪数据集上的实验结果验证了我们提出的方法优于几种基线方法。我们还证明,我们的跨模态听觉定位方法可以帮助在光线不足的条件下移动车辆的视觉定位。
{"title":"Self-Supervised Moving Vehicle Tracking With Stereo Sound","authors":"Chuang Gan, Hang Zhao, Peihao Chen, David D. Cox, A. Torralba","doi":"10.1109/ICCV.2019.00715","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00715","url":null,"abstract":"Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground truth annotations. In particular, we propose a framework that consists of a vision ``teacher'' network and a stereo-sound ``student'' network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization using just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Auditory Vehicles Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"29 1","pages":"7052-7061"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80793039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 124
SO-HandNet: Self-Organizing Network for 3D Hand Pose Estimation With Semi-Supervised Learning SO-HandNet:基于半监督学习的三维手部姿态估计自组织网络
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00706
Yujin Chen, Zhigang Tu, Liuhao Ge, Dejun Zhang, Ruizhi Chen, Junsong Yuan
3D hand pose estimation has made significant progress recently, where Convolutional Neural Networks (CNNs) play a critical role. However, most of the existing CNN-based hand pose estimation methods depend much on the training set, while labeling 3D hand pose on training data is laborious and time-consuming. Inspired by the point cloud autoencoder presented in self-organizing network (SO-Net), our proposed SO-HandNet aims at making use of the unannotated data to obtain accurate 3D hand pose estimation in a semi-supervised manner. We exploit hand feature encoder (HFE) to extract multi-level features from hand point cloud and then fuse them to regress 3D hand pose by a hand pose estimator (HPE). We design a hand feature decoder (HFD) to recover the input point cloud from the encoded feature. Since the HFE and the HFD can be trained without 3D hand pose annotation, the proposed method is able to make the best of unannotated data during the training phase. Experiments on four challenging benchmark datasets validate that our proposed SO-HandNet can achieve superior performance for 3D hand pose estimation via semi-supervised learning.
三维手部姿态估计近年来取得了重大进展,其中卷积神经网络(cnn)发挥了关键作用。然而,现有的基于cnn的手部姿态估计方法大多依赖于训练集,而在训练数据上标注三维手部姿态既费力又耗时。受自组织网络(SO-Net)中点云自动编码器的启发,我们提出的SO-HandNet旨在利用未注释的数据以半监督的方式获得准确的三维手部姿态估计。利用手部特征编码器(HFE)从手部点云中提取多层次特征,然后通过手部姿态估计器(HPE)将这些特征融合到三维手部姿态回归中。我们设计了一个手部特征解码器(HFD)来从编码特征中恢复输入点云。由于HFE和HFD可以在没有三维手姿注释的情况下进行训练,因此该方法能够在训练阶段充分利用未注释的数据。在四个具有挑战性的基准数据集上进行的实验验证了我们提出的SO-HandNet通过半监督学习可以获得优异的3D手部姿态估计性能。
{"title":"SO-HandNet: Self-Organizing Network for 3D Hand Pose Estimation With Semi-Supervised Learning","authors":"Yujin Chen, Zhigang Tu, Liuhao Ge, Dejun Zhang, Ruizhi Chen, Junsong Yuan","doi":"10.1109/ICCV.2019.00706","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00706","url":null,"abstract":"3D hand pose estimation has made significant progress recently, where Convolutional Neural Networks (CNNs) play a critical role. However, most of the existing CNN-based hand pose estimation methods depend much on the training set, while labeling 3D hand pose on training data is laborious and time-consuming. Inspired by the point cloud autoencoder presented in self-organizing network (SO-Net), our proposed SO-HandNet aims at making use of the unannotated data to obtain accurate 3D hand pose estimation in a semi-supervised manner. We exploit hand feature encoder (HFE) to extract multi-level features from hand point cloud and then fuse them to regress 3D hand pose by a hand pose estimator (HPE). We design a hand feature decoder (HFD) to recover the input point cloud from the encoded feature. Since the HFE and the HFD can be trained without 3D hand pose annotation, the proposed method is able to make the best of unannotated data during the training phase. Experiments on four challenging benchmark datasets validate that our proposed SO-HandNet can achieve superior performance for 3D hand pose estimation via semi-supervised learning.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"6 1","pages":"6960-6969"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80157495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Hierarchical Self-Attention Network for Action Localization in Videos 视频动作定位的层次自注意网络
Pub Date : 2019-10-01 DOI: 10.1109/ICCV.2019.00015
Rizard Renanda Adhi Pramono, Yie-Tarng Chen, Wen-Hsien Fang
This paper presents a novel Hierarchical Self-Attention Network (HISAN) to generate spatial-temporal tubes for action localization in videos. The essence of HISAN is to combine the two-stream convolutional neural network (CNN) with hierarchical bidirectional self-attention mechanism, which comprises of two levels of bidirectional self-attention to efficaciously capture both of the long-term temporal dependency information and spatial context information to render more precise action localization. Also, a sequence rescoring (SR) algorithm is employed to resolve the dilemma of inconsistent detection scores incurred by occlusion or background clutter. Moreover, a new fusion scheme is invoked, which integrates not only the appearance and motion information from the two-stream network, but also the motion saliency to mitigate the effect of camera motion. Simulations reveal that the new approach achieves competitive performance as the state-of-the-art works in terms of action localization and recognition accuracy on the widespread UCF101-24 and J-HMDB datasets.
提出了一种新的层次自注意网络(HISAN),用于生成视频动作定位的时空管。HISAN的本质是将两流卷积神经网络(CNN)与分层双向自注意机制相结合,该机制由两层双向自注意组成,有效地捕获长期时间依赖信息和空间上下文信息,从而实现更精确的动作定位。同时,采用序列重分(SR)算法解决了遮挡或背景杂波导致的检测分数不一致的困境。此外,还引入了一种新的融合方案,该方案不仅融合了两流网络的外观和运动信息,而且还结合了运动显著性来减轻摄像机运动的影响。仿真结果表明,在UCF101-24和J-HMDB数据集上,新方法在动作定位和识别精度方面取得了具有竞争力的性能。
{"title":"Hierarchical Self-Attention Network for Action Localization in Videos","authors":"Rizard Renanda Adhi Pramono, Yie-Tarng Chen, Wen-Hsien Fang","doi":"10.1109/ICCV.2019.00015","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00015","url":null,"abstract":"This paper presents a novel Hierarchical Self-Attention Network (HISAN) to generate spatial-temporal tubes for action localization in videos. The essence of HISAN is to combine the two-stream convolutional neural network (CNN) with hierarchical bidirectional self-attention mechanism, which comprises of two levels of bidirectional self-attention to efficaciously capture both of the long-term temporal dependency information and spatial context information to render more precise action localization. Also, a sequence rescoring (SR) algorithm is employed to resolve the dilemma of inconsistent detection scores incurred by occlusion or background clutter. Moreover, a new fusion scheme is invoked, which integrates not only the appearance and motion information from the two-stream network, but also the motion saliency to mitigate the effect of camera motion. Simulations reveal that the new approach achieves competitive performance as the state-of-the-art works in terms of action localization and recognition accuracy on the widespread UCF101-24 and J-HMDB datasets.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"7 1","pages":"61-70"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83425365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1