首页 > 最新文献

2021 IEEE/CVF International Conference on Computer Vision (ICCV)最新文献

英文 中文
Viewpoint-Agnostic Change Captioning with Cycle Consistency 与周期一致性的观点不可知的更改标题
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00210
Hoeseong Kim, Jongseok Kim, Hyungseok Lee, Hyun-a Park, Gunhee Kim
Change captioning is the task of identifying the change and describing it with a concise caption. Despite recent advancements, filtering out insignificant changes still remains as a challenge. Namely, images from different camera perspectives can cause issues; a mere change in viewpoint should be disregarded while still capturing the actual changes. In order to tackle this problem, we present a new Viewpoint-Agnostic change captioning network with Cycle Consistency (VACC) that requires only one image each for the before and after scene, without depending on any other information. We achieve this by devising a new difference encoder module which can encode viewpoint information and model the difference more effectively. In addition, we propose a cycle consistency module that can potentially improve the performance of any change captioning networks in general by matching the composite feature of the generated caption and before image with the after image feature. We evaluate the performance of our proposed model across three datasets for change captioning, including a novel dataset we introduce here that contains images with changes under extreme viewpoint shifts. Through our experiments, we show the excellence of our method with respect to the CIDEr, BLEU-4, METEOR and SPICE scores. Moreover, we demonstrate that attaching our proposed cycle consistency module yields a performance boost for existing change captioning networks, even with varying image encoding mechanisms.
更改标题是识别更改并使用简明的标题描述更改的任务。尽管最近取得了一些进展,但过滤掉无关紧要的变化仍然是一个挑战。也就是说,从不同的相机角度拍摄的图像可能会引起问题;在捕捉实际变化的同时,应该忽略视点上的单纯变化。为了解决这个问题,我们提出了一种新的具有周期一致性(VACC)的视点不可知变化字幕网络,该网络只需要前后场景各一张图像,而不依赖于任何其他信息。为此,我们设计了一种新的差分编码器模块,该模块可以对视点信息进行编码,并更有效地对差分进行建模。此外,我们提出了一个循环一致性模块,该模块可以通过将生成的标题和前图像的复合特征与后图像特征相匹配,从而潜在地提高任何更改字幕网络的性能。我们在三个数据集上评估了我们提出的模型的性能,其中包括我们在这里介绍的一个新数据集,该数据集包含在极端视点变化下发生变化的图像。通过实验,我们证明了我们的方法在CIDEr, BLEU-4, METEOR和SPICE分数方面的卓越性。此外,我们证明,即使使用不同的图像编码机制,附加我们提出的循环一致性模块也会对现有的更改字幕网络产生性能提升。
{"title":"Viewpoint-Agnostic Change Captioning with Cycle Consistency","authors":"Hoeseong Kim, Jongseok Kim, Hyungseok Lee, Hyun-a Park, Gunhee Kim","doi":"10.1109/ICCV48922.2021.00210","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00210","url":null,"abstract":"Change captioning is the task of identifying the change and describing it with a concise caption. Despite recent advancements, filtering out insignificant changes still remains as a challenge. Namely, images from different camera perspectives can cause issues; a mere change in viewpoint should be disregarded while still capturing the actual changes. In order to tackle this problem, we present a new Viewpoint-Agnostic change captioning network with Cycle Consistency (VACC) that requires only one image each for the before and after scene, without depending on any other information. We achieve this by devising a new difference encoder module which can encode viewpoint information and model the difference more effectively. In addition, we propose a cycle consistency module that can potentially improve the performance of any change captioning networks in general by matching the composite feature of the generated caption and before image with the after image feature. We evaluate the performance of our proposed model across three datasets for change captioning, including a novel dataset we introduce here that contains images with changes under extreme viewpoint shifts. Through our experiments, we show the excellence of our method with respect to the CIDEr, BLEU-4, METEOR and SPICE scores. Moreover, we demonstrate that attaching our proposed cycle consistency module yields a performance boost for existing change captioning networks, even with varying image encoding mechanisms.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"48 1","pages":"2075-2084"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79757808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation 注意中心:基于注意的多人物姿态估计中心关键点分组
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01164
Guillem Bras'o, Nikita Kister, L. Leal-Taix'e
We introduce CenterGroup, an attention-based framework to estimate human poses from a set of identity-agnostic keypoints and person center predictions in an image. Our approach uses a transformer to obtain context-aware embeddings for all detected keypoints and centers and then applies multi-head attention to directly group joints into their corresponding person centers. While most bottom-up methods rely on non-learnable clustering at inference, CenterGroup uses a fully differentiable attention mechanism that we train end-to-end together with our keypoint detector. As a result, our method obtains state-of-the-art performance with up to 2.5x faster inference time than competing bottom-up approaches. Our code is available at https://github.com/dvl-tum/center-group
我们引入了CenterGroup,一个基于注意力的框架,从一组身份不可知论的关键点和图像中的人物中心预测中估计人体姿势。我们的方法使用变压器获得所有检测到的关键点和中心的上下文感知嵌入,然后应用多头注意力将关节直接分组到相应的人员中心。虽然大多数自下而上的方法依赖于不可学习的聚类推理,但CenterGroup使用了一种完全可微分的注意力机制,我们将其与关键点检测器一起进行端到端训练。因此,我们的方法获得了最先进的性能,推理时间比竞争的自下而上方法快2.5倍。我们的代码可在https://github.com/dvl-tum/center-group上获得
{"title":"The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation","authors":"Guillem Bras'o, Nikita Kister, L. Leal-Taix'e","doi":"10.1109/ICCV48922.2021.01164","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01164","url":null,"abstract":"We introduce CenterGroup, an attention-based framework to estimate human poses from a set of identity-agnostic keypoints and person center predictions in an image. Our approach uses a transformer to obtain context-aware embeddings for all detected keypoints and centers and then applies multi-head attention to directly group joints into their corresponding person centers. While most bottom-up methods rely on non-learnable clustering at inference, CenterGroup uses a fully differentiable attention mechanism that we train end-to-end together with our keypoint detector. As a result, our method obtains state-of-the-art performance with up to 2.5x faster inference time than competing bottom-up approaches. Our code is available at https://github.com/dvl-tum/center-group","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"11833-11843"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80414275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Structure-transformed Texture-enhanced Network for Person Image Synthesis 用于人物图像合成的结构变换纹理增强网络
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01360
Munan Xu, Yuanqi Chen, Sha Liu, Thomas H. Li, Gezhong Li
Pose-guided virtual try-on task aims to modify the fashion item based on pose transfer task. These two tasks that belong to person image synthesis have strong correlations and similarities. However, existing methods treat them as two individual tasks and do not explore correlations between them. Moreover, these two tasks are challenging due to large misalignment and occlusions, thus most of these methods are prone to generate unclear human body structure and blurry fine-grained textures. In this paper, we devise a structure-transformed texture-enhanced network to generate high-quality person images and construct the relationships between two tasks. It consists of two modules: structure-transformed renderer and texture-enhanced stylizer. The structure-transformed renderer is introduced to transform the source person structure to the target one, while the texture-enhanced stylizer is served to enhance detailed textures and controllably inject the fashion style founded on the structural transformation. With the two modules, our model can generate photorealistic person images in diverse poses and even with various fashion styles. Extensive experiments demonstrate that our approach achieves state-of-the-art results on two tasks.
姿态引导虚拟试穿任务是基于姿态转移任务对时尚单品进行修改。这两个任务都属于人物图像合成,具有很强的相关性和相似性。然而,现有的方法将它们视为两个独立的任务,并没有探索它们之间的相关性。此外,由于存在较大的错位和遮挡,这两项任务具有挑战性,因此大多数方法容易产生不清晰的人体结构和模糊的细粒度纹理。在本文中,我们设计了一个结构转换的纹理增强网络来生成高质量的人物图像,并构建两个任务之间的关系。它由两个模块组成:结构转换渲染器和纹理增强样式器。引入结构转换渲染器将源人物结构转换为目标人物结构,引入纹理增强造型器增强细节纹理,并在结构转换的基础上可控地注入时尚风格。通过这两个模块,我们的模型可以生成各种姿势甚至各种时尚风格的逼真人物图像。大量的实验表明,我们的方法在两个任务上达到了最先进的结果。
{"title":"Structure-transformed Texture-enhanced Network for Person Image Synthesis","authors":"Munan Xu, Yuanqi Chen, Sha Liu, Thomas H. Li, Gezhong Li","doi":"10.1109/ICCV48922.2021.01360","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01360","url":null,"abstract":"Pose-guided virtual try-on task aims to modify the fashion item based on pose transfer task. These two tasks that belong to person image synthesis have strong correlations and similarities. However, existing methods treat them as two individual tasks and do not explore correlations between them. Moreover, these two tasks are challenging due to large misalignment and occlusions, thus most of these methods are prone to generate unclear human body structure and blurry fine-grained textures. In this paper, we devise a structure-transformed texture-enhanced network to generate high-quality person images and construct the relationships between two tasks. It consists of two modules: structure-transformed renderer and texture-enhanced stylizer. The structure-transformed renderer is introduced to transform the source person structure to the target one, while the texture-enhanced stylizer is served to enhance detailed textures and controllably inject the fashion style founded on the structural transformation. With the two modules, our model can generate photorealistic person images in diverse poses and even with various fashion styles. Extensive experiments demonstrate that our approach achieves state-of-the-art results on two tasks.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"101 1","pages":"13839-13848"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80543400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Kernel Methods in Hyperbolic Spaces 双曲空间中的核方法
Pub Date : 2021-10-01 DOI: 10.1109/iccv48922.2021.01049
Pengfei Fang, Mehrtash Harandi, L. Petersson
Embedding data in hyperbolic spaces has proven beneficial for many advanced machine learning applications such as image classification and word embeddings. However, working in hyperbolic spaces is not without difficulties as a result of its curved geometry (e.g., computing the Frechet mean of a set of points requires an iterative algorithm). Furthermore, in Euclidean spaces, one can resort to kernel machines that not only enjoy rich theoretical properties but that can also lead to superior representational power (e.g., infinite-width neural networks). In this paper, we introduce positive definite kernel functions for hyperbolic spaces. This brings in two major advantages, 1. kernelization will pave the way to seamlessly benefit from kernel machines in conjunction with hyperbolic embeddings, and 2. the rich structure of the Hilbert spaces associated with kernel machines enables us to simplify various operations involving hyperbolic data. That said, identifying valid kernel functions on curved spaces is not straightforward and is indeed considered an open problem in the learning community. Our work addresses this gap and develops several valid positive definite kernels in hyperbolic spaces, including the universal ones (e.g., RBF). We comprehensively study the proposed kernels on a variety of challenging tasks including few-shot learning, zero-shot learning, person reidentification and knowledge distillation, showing the superiority of the kernelization for hyperbolic representations.
在双曲空间中嵌入数据已被证明对许多高级机器学习应用有益,例如图像分类和词嵌入。然而,由于其弯曲的几何结构,在双曲空间中工作并非没有困难(例如,计算一组点的Frechet平均值需要迭代算法)。此外,在欧几里得空间中,人们可以求助于内核机器,它不仅具有丰富的理论性质,而且还可以导致优越的表征能力(例如,无限宽的神经网络)。本文引入了双曲空间的正定核函数。这带来了两个主要的优点:1。核化将为从结合双曲嵌入的核机中无缝获益铺平道路;与核机相关的希尔伯特空间的丰富结构使我们能够简化涉及双曲数据的各种操作。也就是说,在弯曲空间上识别有效的核函数并不简单,而且在学习社区中确实被认为是一个开放的问题。我们的工作解决了这一空白,并在双曲空间中发展了几个有效的正定核,包括全称核(如RBF)。我们全面研究了所提出的核算法在各种具有挑战性的任务上的应用,包括少样本学习、零样本学习、人再识别和知识蒸馏,显示了双曲表示核算法的优越性。
{"title":"Kernel Methods in Hyperbolic Spaces","authors":"Pengfei Fang, Mehrtash Harandi, L. Petersson","doi":"10.1109/iccv48922.2021.01049","DOIUrl":"https://doi.org/10.1109/iccv48922.2021.01049","url":null,"abstract":"Embedding data in hyperbolic spaces has proven beneficial for many advanced machine learning applications such as image classification and word embeddings. However, working in hyperbolic spaces is not without difficulties as a result of its curved geometry (e.g., computing the Frechet mean of a set of points requires an iterative algorithm). Furthermore, in Euclidean spaces, one can resort to kernel machines that not only enjoy rich theoretical properties but that can also lead to superior representational power (e.g., infinite-width neural networks). In this paper, we introduce positive definite kernel functions for hyperbolic spaces. This brings in two major advantages, 1. kernelization will pave the way to seamlessly benefit from kernel machines in conjunction with hyperbolic embeddings, and 2. the rich structure of the Hilbert spaces associated with kernel machines enables us to simplify various operations involving hyperbolic data. That said, identifying valid kernel functions on curved spaces is not straightforward and is indeed considered an open problem in the learning community. Our work addresses this gap and develops several valid positive definite kernels in hyperbolic spaces, including the universal ones (e.g., RBF). We comprehensively study the proposed kernels on a variety of challenging tasks including few-shot learning, zero-shot learning, person reidentification and knowledge distillation, showing the superiority of the kernelization for hyperbolic representations.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"10645-10654"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80919473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Efficient Action Recognition via Dynamic Knowledge Propagation 基于动态知识传播的高效动作识别
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01346
Hanul Kim, Mihir Jain, Jun-Tae Lee, Sungrack Yun, F. Porikli
Efficient action recognition has become crucial to extend the success of action recognition to many real-world applications. Contrary to most existing methods, which mainly focus on selecting salient frames to reduce the computation cost, we focus more on making the most of the selected frames. To this end, we employ two networks of different capabilities that operate in tandem to efficiently recognize actions. Given a video, the lighter network processes more frames while the heavier one only processes a few. In order to enable the effective interaction between the two, we propose dynamic knowledge propagation based on a cross-attention mechanism. This is the main component of our framework that is essentially a student-teacher architecture, but as the teacher model continues to interact with the student model during inference, we call it a dynamic student-teacher framework. Through extensive experiments, we demonstrate the effectiveness of each component of our framework. Our method outperforms competing state-of-the-art methods on two video datasets: ActivityNet-v1.3 and Mini-Kinetics.
高效的动作识别已成为将动作识别成功扩展到许多实际应用的关键。与大多数现有方法主要侧重于选择显著帧以减少计算成本不同,我们更侧重于充分利用所选帧。为此,我们采用了两个不同功能的网络,它们串联运行以有效地识别动作。给定一个视频,较轻的网络处理更多帧,而较重的网络只处理少数帧。为了实现两者之间的有效交互,我们提出了基于交叉注意机制的动态知识传播。这是我们的框架的主要组成部分,本质上是一个学生-教师架构,但是由于教师模型在推理过程中继续与学生模型交互,我们称之为动态学生-教师框架。通过大量的实验,我们证明了框架的每个组成部分的有效性。我们的方法在两个视频数据集(ActivityNet-v1.3和Mini-Kinetics)上优于最先进的竞争方法。
{"title":"Efficient Action Recognition via Dynamic Knowledge Propagation","authors":"Hanul Kim, Mihir Jain, Jun-Tae Lee, Sungrack Yun, F. Porikli","doi":"10.1109/ICCV48922.2021.01346","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01346","url":null,"abstract":"Efficient action recognition has become crucial to extend the success of action recognition to many real-world applications. Contrary to most existing methods, which mainly focus on selecting salient frames to reduce the computation cost, we focus more on making the most of the selected frames. To this end, we employ two networks of different capabilities that operate in tandem to efficiently recognize actions. Given a video, the lighter network processes more frames while the heavier one only processes a few. In order to enable the effective interaction between the two, we propose dynamic knowledge propagation based on a cross-attention mechanism. This is the main component of our framework that is essentially a student-teacher architecture, but as the teacher model continues to interact with the student model during inference, we call it a dynamic student-teacher framework. Through extensive experiments, we demonstrate the effectiveness of each component of our framework. Our method outperforms competing state-of-the-art methods on two video datasets: ActivityNet-v1.3 and Mini-Kinetics.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"42 2 1","pages":"13699-13708"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82859374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning 基于文本感知子空间联合学习的弱监督动作分割和对齐
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00798
Zijia Lu, Ehsan Elhamifar
We address the problem of learning to segment actions from weakly-annotated videos, i.e., videos accompanied by transcripts (ordered list of actions). We propose a framework in which we model actions with a union of low-dimensional subspaces, learn the subspaces using transcripts and refine video features that lend themselves to action subspaces. To do so, we design an architecture consisting of a Union-of-Subspaces Network, which is an ensemble of autoencoders, each modeling a low-dimensional action subspace and can capture variations of an action within and across videos. For learning, at each iteration, we generate positive and negative soft alignment matrices using the segmentations from the previous iteration, which we use for discriminative training of our model. To regularize the learning, we introduce a constraint loss that prevents imbalanced segmentations and enforces relatively similar duration of each action across videos. To have a real-time inference, we develop a hierarchical segmentation framework that uses subset selection to find representative transcripts and hierarchically align a test video with increasingly refined representative transcripts. Our experiments on three datasets show that our method improves the state-of-the-art action segmentation and alignment, while speeding up the inference time by a factor of 4 to 13. 1
我们解决了学习从弱注释视频中分割动作的问题,即带有抄本的视频(有序的动作列表)。我们提出了一个框架,在这个框架中,我们用低维子空间的联合来建模动作,使用成绩单学习子空间,并改进适合动作子空间的视频特征。为此,我们设计了一个由子空间联合网络组成的架构,该网络是一个自动编码器的集合,每个编码器都建模一个低维动作子空间,并且可以捕获视频内部和跨视频的动作变化。对于学习,在每次迭代中,我们使用前一次迭代的分割生成正和负软对齐矩阵,我们将其用于模型的判别训练。为了规范学习,我们引入了一个约束损失,以防止不平衡的分割,并强制每个动作在视频中相对相似的持续时间。为了进行实时推断,我们开发了一个分层分割框架,该框架使用子集选择来找到具有代表性的转录本,并分层地将测试视频与越来越精细的代表性转录本对齐。我们在三个数据集上的实验表明,我们的方法提高了最先进的动作分割和对齐,同时将推理时间加快了4到13倍。1
{"title":"Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning","authors":"Zijia Lu, Ehsan Elhamifar","doi":"10.1109/ICCV48922.2021.00798","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00798","url":null,"abstract":"We address the problem of learning to segment actions from weakly-annotated videos, i.e., videos accompanied by transcripts (ordered list of actions). We propose a framework in which we model actions with a union of low-dimensional subspaces, learn the subspaces using transcripts and refine video features that lend themselves to action subspaces. To do so, we design an architecture consisting of a Union-of-Subspaces Network, which is an ensemble of autoencoders, each modeling a low-dimensional action subspace and can capture variations of an action within and across videos. For learning, at each iteration, we generate positive and negative soft alignment matrices using the segmentations from the previous iteration, which we use for discriminative training of our model. To regularize the learning, we introduce a constraint loss that prevents imbalanced segmentations and enforces relatively similar duration of each action across videos. To have a real-time inference, we develop a hierarchical segmentation framework that uses subset selection to find representative transcripts and hierarchically align a test video with increasingly refined representative transcripts. Our experiments on three datasets show that our method improves the state-of-the-art action segmentation and alignment, while speeding up the inference time by a factor of 4 to 13. 1","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"31 1","pages":"8065-8075"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80367779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
PIAP-DF: Pixel-Interested and Anti Person-Specific Facial Action Unit Detection Net with Discrete Feedback Learning PIAP-DF:基于离散反馈学习的像素感兴趣和反人特异性面部动作单元检测网络
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.01266
Yang Tang, Wangding Zeng, Dafei Zhao, Honggang Zhang
Facial Action Units (AUs) are of great significance in communication. Automatic AU detection can improve the understanding of psychological conditions and emotional status. Recently, several deep learning methods have been proposed to detect AUs automatically. However, several challenges, such as poor extraction of fine-grained and robust local AUs information, model overfitting on person-specific features, as well as the limitation of datasets with wrong labels, remain to be addressed. In this paper, we propose a joint strategy called PIAP-DF to solve these problems, which involves 1) a multi-stage Pixel-Interested learning method with pixel-level attention for each AU; 2) an Anti Person-Specific method aiming to eliminate features associated with any individual as much as possible; 3) a semi-supervised learning method with Discrete Feedback, designed to effectively utilize unlabeled data and mitigate the negative impacts of wrong labels. Experimental results on the two popular AU detection datasets BP4D and DISFA prove that PIAP-DF can be the new state-of-the-art method. Compared with the current best method, PIAP-DF improves the average F1 score by 3.2% on BP4D and by 0.5% on DISFA. All modules of PIAP-DF can be easily removed after training to obtain a lightweight model for practical application.
面部动作单位在交际中具有重要意义。自动AU检测可以提高对心理状况和情绪状态的理解。近年来,人们提出了几种深度学习方法来自动检测AUs。然而,一些挑战,如细粒度和鲁棒的局部AUs信息的提取不良,对个人特征的模型过拟合,以及带有错误标签的数据集的限制,仍有待解决。在本文中,我们提出了一种称为PIAP-DF的联合策略来解决这些问题,该策略包括:1)对每个AU进行像素级关注的多阶段像素感兴趣学习方法;2)一种Anti - Person-Specific方法,旨在尽可能消除与任何个体相关的特征;3)一种具有离散反馈的半监督学习方法,旨在有效地利用未标记数据并减轻错误标记的负面影响。在两个流行的天文探测数据集BP4D和DISFA上的实验结果证明了PIAP-DF是一种新的最先进的方法。与目前最好的方法相比,PIAP-DF在BP4D和DISFA上的平均F1分数分别提高了3.2%和0.5%。PIAP-DF的所有模块都可以在训练后轻松移除,以获得实际应用的轻量级模型。
{"title":"PIAP-DF: Pixel-Interested and Anti Person-Specific Facial Action Unit Detection Net with Discrete Feedback Learning","authors":"Yang Tang, Wangding Zeng, Dafei Zhao, Honggang Zhang","doi":"10.1109/ICCV48922.2021.01266","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.01266","url":null,"abstract":"Facial Action Units (AUs) are of great significance in communication. Automatic AU detection can improve the understanding of psychological conditions and emotional status. Recently, several deep learning methods have been proposed to detect AUs automatically. However, several challenges, such as poor extraction of fine-grained and robust local AUs information, model overfitting on person-specific features, as well as the limitation of datasets with wrong labels, remain to be addressed. In this paper, we propose a joint strategy called PIAP-DF to solve these problems, which involves 1) a multi-stage Pixel-Interested learning method with pixel-level attention for each AU; 2) an Anti Person-Specific method aiming to eliminate features associated with any individual as much as possible; 3) a semi-supervised learning method with Discrete Feedback, designed to effectively utilize unlabeled data and mitigate the negative impacts of wrong labels. Experimental results on the two popular AU detection datasets BP4D and DISFA prove that PIAP-DF can be the new state-of-the-art method. Compared with the current best method, PIAP-DF improves the average F1 score by 3.2% on BP4D and by 0.5% on DISFA. All modules of PIAP-DF can be easily removed after training to obtain a lightweight model for practical application.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"53 1","pages":"12879-12888"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81016357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
On Equivariant and Invariant Learning of Object Landmark Representations 物体地标表征的等变与不变学习
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00975
Zezhou Cheng, Jong-Chyi Su, Subhransu Maji
Given a collection of images, humans are able to discover landmarks by modeling the shared geometric structure across instances. This idea of geometric equivariance has been widely used for the unsupervised discovery of object landmark representations. In this paper, we develop a simple and effective approach by combining instance-discriminative and spatially-discriminative contrastive learning. We show that when a deep network is trained to be invariant to geometric and photometric transformations, representations emerge from its intermediate layers that are highly predictive of object landmarks. Stacking these across layers in a "hypercolumn" and projecting them using spatially-contrastive learning further improves their performance on matching and few-shot landmark regression tasks. We also present a unified view of existing equivariant and invariant representation learning approaches through the lens of contrastive learning, shedding light on the nature of invariances learned. Experiments on standard benchmarks for landmark learning, as well as a new challenging one we propose, show that the proposed approach surpasses prior state-of-the-art.
给定一组图像,人类能够通过对实例之间的共享几何结构建模来发现地标。这种几何等方差的思想已被广泛应用于物体地标表示的无监督发现。本文将实例判别和空间判别相结合,提出了一种简单有效的对比学习方法。我们表明,当一个深度网络被训练成对几何和光度变换不变性时,它的中间层会出现对物体地标具有高度预测性的表示。在“超列”中堆叠这些跨层,并使用空间对比学习来投射它们,进一步提高了它们在匹配和少量地标回归任务上的性能。我们还通过对比学习的视角,对现有的等变和不变表征学习方法提出了统一的观点,揭示了所学习的不变性的本质。在里程碑式学习的标准基准以及我们提出的一个新的具有挑战性的基准上进行的实验表明,我们提出的方法超越了现有的最先进的方法。
{"title":"On Equivariant and Invariant Learning of Object Landmark Representations","authors":"Zezhou Cheng, Jong-Chyi Su, Subhransu Maji","doi":"10.1109/ICCV48922.2021.00975","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00975","url":null,"abstract":"Given a collection of images, humans are able to discover landmarks by modeling the shared geometric structure across instances. This idea of geometric equivariance has been widely used for the unsupervised discovery of object landmark representations. In this paper, we develop a simple and effective approach by combining instance-discriminative and spatially-discriminative contrastive learning. We show that when a deep network is trained to be invariant to geometric and photometric transformations, representations emerge from its intermediate layers that are highly predictive of object landmarks. Stacking these across layers in a \"hypercolumn\" and projecting them using spatially-contrastive learning further improves their performance on matching and few-shot landmark regression tasks. We also present a unified view of existing equivariant and invariant representation learning approaches through the lens of contrastive learning, shedding light on the nature of invariances learned. Experiments on standard benchmarks for landmark learning, as well as a new challenging one we propose, show that the proposed approach surpasses prior state-of-the-art.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"33 1","pages":"9877-9886"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81337686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Learning to Bundle-adjust: A Graph Network Approach to Faster Optimization of Bundle Adjustment for Vehicular SLAM 学习捆绑调整:一种快速优化车辆SLAM捆绑调整的图网络方法
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00619
Tetsuya Tanaka, Socionext Inc, Yukihiro Sasagawa, Takayuki Okatani
Bundle adjustment (BA) occupies a large portion of the execution time of SfM and visual SLAM. Local BA over the latest several keyframes plays a crucial role in visual SLAM. Its execution time should be sufficiently short for robust tracking; this is especially critical for embedded systems with a limited computational resource. This study proposes a learning-based bundle adjuster using a graph network. It works faster and can be used instead of conventional optimization-based BA. The graph network operates on a graph consisting of the nodes of keyframes and landmarks and the edges representing the landmarks’ visibility. The graph network receives the parameters’ initial values as inputs and predicts their updates to the optimal values. It internally uses an intermediate representation of inputs which we design inspired by the normal equation of the Levenberg-Marquardt method. It is trained using the sum of reprojection errors as a loss function. The experiments show that the proposed method outputs parameter estimates with slightly inferior accuracy in 1/60–1/10 of time compared with the conventional BA.
Bundle adjustment (BA)在SfM和visual SLAM的执行时间中占据了很大一部分。最近几个关键帧的本地BA在视觉SLAM中起着至关重要的作用。它的执行时间应该足够短,以便进行鲁棒跟踪;这对于计算资源有限的嵌入式系统尤其重要。本文提出了一种基于学习的束调节器。它的工作速度更快,可以代替传统的基于优化的BA。图网络在一个图上运行,这个图由关键帧和地标的节点以及代表地标可见性的边组成。图网络接收参数的初始值作为输入,并预测其更新到最优值。它在内部使用输入的中间表示,我们设计的灵感来自Levenberg-Marquardt方法的正规方程。它使用重投影误差的和作为损失函数来训练。实验表明,该方法在1/60-1/10的时间内输出的参数估计精度略低于传统的BA。
{"title":"Learning to Bundle-adjust: A Graph Network Approach to Faster Optimization of Bundle Adjustment for Vehicular SLAM","authors":"Tetsuya Tanaka, Socionext Inc, Yukihiro Sasagawa, Takayuki Okatani","doi":"10.1109/ICCV48922.2021.00619","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00619","url":null,"abstract":"Bundle adjustment (BA) occupies a large portion of the execution time of SfM and visual SLAM. Local BA over the latest several keyframes plays a crucial role in visual SLAM. Its execution time should be sufficiently short for robust tracking; this is especially critical for embedded systems with a limited computational resource. This study proposes a learning-based bundle adjuster using a graph network. It works faster and can be used instead of conventional optimization-based BA. The graph network operates on a graph consisting of the nodes of keyframes and landmarks and the edges representing the landmarks’ visibility. The graph network receives the parameters’ initial values as inputs and predicts their updates to the optimal values. It internally uses an intermediate representation of inputs which we design inspired by the normal equation of the Levenberg-Marquardt method. It is trained using the sum of reprojection errors as a loss function. The experiments show that the proposed method outputs parameter estimates with slightly inferior accuracy in 1/60–1/10 of time compared with the conventional BA.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"6230-6239"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89568553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Semantic Aware Data Augmentation for Cell Nuclei Microscopical Images with Artificial Neural Networks 基于人工神经网络的细胞核显微图像语义感知数据增强
Pub Date : 2021-10-01 DOI: 10.1109/ICCV48922.2021.00392
Alireza Naghizadeh, Hongye Xu, Mohab Mohamed, Dimitris N. Metaxas, Dongfang Liu
There exists many powerful architectures for object detection and semantic segmentation of both biomedical and natural images. However, a difficulty arises in the ability to create training datasets that are large and well-varied. The importance of this subject is nested in the amount of training data that artificial neural networks need to accurately identify and segment objects in images and the infeasibility of acquiring a sufficient dataset within the biomedical field. This paper introduces a new data augmentation method that generates artificial cell nuclei microscopical images along with their correct semantic segmentation labels. Data augmentation provides a step toward accessing higher generalization capabilities of artificial neural networks. An initial set of segmentation objects is used with Greedy AutoAugment to find the strongest performing augmentation policies. The found policies and the initial set of segmentation objects are then used in the creation of the final artificial images. When comparing the state-of-the-art data augmentation methods with the proposed method, the proposed method is shown to consistently outperform current solutions in the generation of nuclei microscopical images.
生物医学和自然图像的目标检测和语义分割已经有了许多强大的体系结构。然而,创建大型且变化良好的训练数据集的能力出现了困难。该主题的重要性嵌套在人工神经网络需要准确识别和分割图像中的物体所需的训练数据量以及在生物医学领域获取足够数据集的不可行性中。本文介绍了一种新的数据增强方法,即生成具有正确语义分割标签的人工细胞核显微图像。数据增强为获得人工神经网络的更高泛化能力提供了一步。一组初始分割对象与Greedy AutoAugment一起使用,以找到性能最强的增强策略。然后将找到的策略和初始分割对象集用于创建最终的人工图像。当比较最先进的数据增强方法与所提出的方法时,所提出的方法在生成核显微图像方面始终优于当前的解决方案。
{"title":"Semantic Aware Data Augmentation for Cell Nuclei Microscopical Images with Artificial Neural Networks","authors":"Alireza Naghizadeh, Hongye Xu, Mohab Mohamed, Dimitris N. Metaxas, Dongfang Liu","doi":"10.1109/ICCV48922.2021.00392","DOIUrl":"https://doi.org/10.1109/ICCV48922.2021.00392","url":null,"abstract":"There exists many powerful architectures for object detection and semantic segmentation of both biomedical and natural images. However, a difficulty arises in the ability to create training datasets that are large and well-varied. The importance of this subject is nested in the amount of training data that artificial neural networks need to accurately identify and segment objects in images and the infeasibility of acquiring a sufficient dataset within the biomedical field. This paper introduces a new data augmentation method that generates artificial cell nuclei microscopical images along with their correct semantic segmentation labels. Data augmentation provides a step toward accessing higher generalization capabilities of artificial neural networks. An initial set of segmentation objects is used with Greedy AutoAugment to find the strongest performing augmentation policies. The found policies and the initial set of segmentation objects are then used in the creation of the final artificial images. When comparing the state-of-the-art data augmentation methods with the proposed method, the proposed method is shown to consistently outperform current solutions in the generation of nuclei microscopical images.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"31 1","pages":"3932-3941"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87024820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1