首页 > 最新文献

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文 中文
RankDetNet: Delving into Ranking Constraints for Object Detection RankDetNet:深入研究对象检测的排名约束
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00033
Ji Liu, Dong Li, R. Zheng, Luchao Tian, Yi Shan
Modern object detection approaches cast detecting objects as optimizing two subtasks of classification and localization simultaneously. Existing methods often learn the classification task by optimizing each proposal separately and neglect the relationship among different proposals. Such detection paradigm also encounters the mismatch between classification and localization due to the inherent discrepancy of their optimization targets. In this work, we propose a ranking-based optimization algorithm for harmoniously learning to rank and localize proposals in lieu of the classification task. To this end, we comprehensively investigate three types of ranking constraints, i.e., global ranking, class-specific ranking and IoU-guided ranking losses. The global ranking loss encourages foreground samples to rank higher than background. The class-specific ranking loss ensures that positive samples rank higher than negative ones for each specific class. The IoU-guided ranking loss aims to align each pair of confidence scores with the associated pair of IoU overlap between two positive samples of a specific class. Our ranking constraints can sufficiently explore the relationships between samples from three different perspectives. They are easy-to-implement, compatible with mainstream detection frameworks and computation-free for inference. Experiments demonstrate that our RankDetNet consistently surpasses prior anchor-based and anchor-free baselines, e.g., improving RetinaNet baseline by 2.5% AP on the COCO test-dev set without bells and whistles. We also apply the proposed ranking constraints for 3D object detection and achieve improved performance, which further validates the superiority and generality of our method.
现代目标检测方法将目标检测作为分类和定位两个子任务的同时优化。现有的方法往往通过单独优化每个提案来学习分类任务,而忽略了不同提案之间的关系。由于其优化目标的内在差异,这种检测范式也会遇到分类与定位不匹配的问题。在这项工作中,我们提出了一种基于排名的优化算法,用于和谐学习来代替分类任务对提案进行排名和定位。为此,我们综合研究了三种类型的排名约束,即全球排名、特定类别排名和iou引导排名损失。全局排名损失鼓励前景样本排名高于背景。类特定的排名损失确保了每个特定类的正样本排名高于负样本。IoU引导的排名损失旨在将每对置信度分数与特定类别的两个正样本之间的相关IoU重叠对对齐。我们的排序约束可以从三个不同的角度充分探索样本之间的关系。它们易于实现,与主流检测框架兼容,并且无需进行推理计算。实验表明,我们的RankDetNet始终优于先前基于锚定和无锚定的基线,例如,在COCO测试开发集上,在没有附加功能的情况下,将retanet基线提高了2.5%。将所提出的排序约束应用于三维目标检测,并取得了较好的性能,进一步验证了该方法的优越性和通用性。
{"title":"RankDetNet: Delving into Ranking Constraints for Object Detection","authors":"Ji Liu, Dong Li, R. Zheng, Luchao Tian, Yi Shan","doi":"10.1109/CVPR46437.2021.00033","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00033","url":null,"abstract":"Modern object detection approaches cast detecting objects as optimizing two subtasks of classification and localization simultaneously. Existing methods often learn the classification task by optimizing each proposal separately and neglect the relationship among different proposals. Such detection paradigm also encounters the mismatch between classification and localization due to the inherent discrepancy of their optimization targets. In this work, we propose a ranking-based optimization algorithm for harmoniously learning to rank and localize proposals in lieu of the classification task. To this end, we comprehensively investigate three types of ranking constraints, i.e., global ranking, class-specific ranking and IoU-guided ranking losses. The global ranking loss encourages foreground samples to rank higher than background. The class-specific ranking loss ensures that positive samples rank higher than negative ones for each specific class. The IoU-guided ranking loss aims to align each pair of confidence scores with the associated pair of IoU overlap between two positive samples of a specific class. Our ranking constraints can sufficiently explore the relationships between samples from three different perspectives. They are easy-to-implement, compatible with mainstream detection frameworks and computation-free for inference. Experiments demonstrate that our RankDetNet consistently surpasses prior anchor-based and anchor-free baselines, e.g., improving RetinaNet baseline by 2.5% AP on the COCO test-dev set without bells and whistles. We also apply the proposed ranking constraints for 3D object detection and achieve improved performance, which further validates the superiority and generality of our method.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132233380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval 跨模态视频矩检索的多模态关系图
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00225
Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, Zheng Qin
Given an untrimmed video and a query sentence, cross-modal video moment retrieval aims to rank a video moment from pre-segmented video moment candidates that best matches the query sentence. Pioneering work typically learns the representations of the textual and visual content separately and then obtains the interactions or alignments between different modalities. However, the task of cross-modal video moment retrieval is not yet thoroughly addressed as it needs to further identify the fine-grained differences of video moment candidates with high repeatability and similarity. Moveover, the relation among objects in both video and sentence is intuitive and efficient for understanding semantics but is rarely considered.Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates. Specifically, we first introduce a visual relational graph and a textual relational graph to form relation-aware representations via message propagation. Thereafter, a multi-task pre-training is designed to capture domain-specific knowledge about objects and relations, enhancing the structured visual representation after explicitly defined relation. Finally, the graph matching and boundary regression are employed to perform the cross-modal retrieval. We conduct extensive experiments on two datasets about daily activities and cooking activities, demonstrating significant improvements over state-of-the-art solutions.
给定一个未修剪的视频和一个查询句子,跨模态视频时刻检索旨在从预分割的视频时刻候选中对最匹配查询句子的视频时刻进行排序。开创性的工作通常分别学习文本和视觉内容的表示,然后获得不同模态之间的相互作用或对齐。然而,跨模态视频时刻检索的任务尚未得到彻底解决,因为它需要进一步识别具有高重复性和相似性的视频时刻候选的细粒度差异。此外,视频和句子中对象之间的关系对于理解语义是直观和有效的,但很少被考虑。为此,我们提供了一个多模态关系图来捕获来自视觉和文本内容的对象之间的相互作用,以识别相似视频候选时刻之间的差异。具体来说,我们首先引入一个视觉关系图和一个文本关系图,通过消息传播形成关系感知表示。然后,设计了一个多任务预训练来捕获关于对象和关系的特定领域知识,增强明确定义关系后的结构化视觉表示。最后,利用图匹配和边界回归进行跨模态检索。我们在两个关于日常活动和烹饪活动的数据集上进行了广泛的实验,证明了比最先进的解决方案有显著改进。
{"title":"Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval","authors":"Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, Zheng Qin","doi":"10.1109/CVPR46437.2021.00225","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00225","url":null,"abstract":"Given an untrimmed video and a query sentence, cross-modal video moment retrieval aims to rank a video moment from pre-segmented video moment candidates that best matches the query sentence. Pioneering work typically learns the representations of the textual and visual content separately and then obtains the interactions or alignments between different modalities. However, the task of cross-modal video moment retrieval is not yet thoroughly addressed as it needs to further identify the fine-grained differences of video moment candidates with high repeatability and similarity. Moveover, the relation among objects in both video and sentence is intuitive and efficient for understanding semantics but is rarely considered.Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates. Specifically, we first introduce a visual relational graph and a textual relational graph to form relation-aware representations via message propagation. Thereafter, a multi-task pre-training is designed to capture domain-specific knowledge about objects and relations, enhancing the structured visual representation after explicitly defined relation. Finally, the graph matching and boundary regression are employed to perform the cross-modal retrieval. We conduct extensive experiments on two datasets about daily activities and cooking activities, demonstrating significant improvements over state-of-the-art solutions.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133025989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Progressive Modality Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences 基于非对齐多模态序列的人类多模态情感识别的渐进式模态强化
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00258
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, Guosheng Lin
Human multimodal emotion recognition involves time-series data of different modalities, such as natural language, visual motions, and acoustic behaviors. Due to the variable sampling rates for sequences from different modalities, the collected multimodal streams are usually unaligned. The asynchrony across modalities increases the difficulty on conducting efficient multimodal fusion. Hence, this work mainly focuses on multimodal fusion from unaligned multimodal sequences. To this end, we propose the Progressive Modality Reinforcement (PMR) approach based on the recent advances of crossmodal transformer. Our approach introduces a message hub to exchange information with each modality. The message hub sends common messages to each modality and reinforces their features via crossmodal attention. In turn, it also collects the reinforced features from each modality and uses them to generate a reinforced common message. By repeating the cycle process, the common message and the modalities’ features can progressively complement each other. Finally, the reinforced features are used to make predictions for human emotion. Comprehensive experiments on different human multimodal emotion recognition benchmarks clearly demonstrate the superiority of our approach.
人类多模态情绪识别涉及不同模态的时间序列数据,如自然语言、视觉运动和听觉行为。由于不同模态序列的采样率不同,采集到的多模态流通常是不对齐的。模态间的异步性增加了进行高效多模态融合的难度。因此,本工作主要集中在从未对齐的多模态序列中进行多模态融合。为此,我们基于跨模态变压器的最新进展,提出了渐进式模态强化(PMR)方法。我们的方法引入了一个消息中心来与每种模式交换信息。消息中心向每个模态发送通用消息,并通过跨模态关注来增强它们的特性。反过来,它还从每个模态收集增强的特征,并使用它们生成增强的公共消息。通过重复循环过程,共同信息和模态特征可以逐步互补。最后,这些增强的特征被用来预测人类的情绪。在不同人类多模态情感识别基准上的综合实验清楚地证明了我们方法的优越性。
{"title":"Progressive Modality Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences","authors":"Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, Guosheng Lin","doi":"10.1109/CVPR46437.2021.00258","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00258","url":null,"abstract":"Human multimodal emotion recognition involves time-series data of different modalities, such as natural language, visual motions, and acoustic behaviors. Due to the variable sampling rates for sequences from different modalities, the collected multimodal streams are usually unaligned. The asynchrony across modalities increases the difficulty on conducting efficient multimodal fusion. Hence, this work mainly focuses on multimodal fusion from unaligned multimodal sequences. To this end, we propose the Progressive Modality Reinforcement (PMR) approach based on the recent advances of crossmodal transformer. Our approach introduces a message hub to exchange information with each modality. The message hub sends common messages to each modality and reinforces their features via crossmodal attention. In turn, it also collects the reinforced features from each modality and uses them to generate a reinforced common message. By repeating the cycle process, the common message and the modalities’ features can progressively complement each other. Finally, the reinforced features are used to make predictions for human emotion. Comprehensive experiments on different human multimodal emotion recognition benchmarks clearly demonstrate the superiority of our approach.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121006170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
OCONet: Image Extrapolation by Object Completion OCONet:通过对象补全进行图像外推
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00234
Richard Strong Bowen, Huiwen Chang, Charles Herrmann, Piotr Teterwak, Ce Liu, R. Zabih
Image extrapolation extends an input image beyond the originally-captured field of view. Existing methods struggle to extrapolate images with salient objects in the foreground or are limited to very specific objects such as humans, but tend to work well on indoor/outdoor scenes. We introduce OCONet (Object COmpletion Networks) to extrapolate foreground objects, with an object completion network conditioned on its class. OCONet uses an encoder-decoder architecture trained with adversarial loss to predict the object’s texture as well as its extent, represented as a predicted signed-distance field. An independent step extends the background, and the object is composited on top using the predicted mask. Both qualitative and quantitative results show that we improve on state-of-the-art image extrapolation results for challenging examples.
图像外推将输入图像扩展到原始捕获的视野之外。现有的方法很难推断出前景中有显著物体的图像,或者仅限于非常具体的物体,如人类,但往往在室内/室外场景中工作得很好。我们引入了OCONet(对象补全网络)来外推前景对象,对象补全网络取决于其类。OCONet使用经过对抗性损失训练的编码器-解码器架构来预测物体的纹理及其范围,以预测的带符号距离域表示。一个独立的步骤扩展背景,并使用预测的蒙版在顶部合成对象。定性和定量结果都表明,我们在具有挑战性的示例中改进了最先进的图像外推结果。
{"title":"OCONet: Image Extrapolation by Object Completion","authors":"Richard Strong Bowen, Huiwen Chang, Charles Herrmann, Piotr Teterwak, Ce Liu, R. Zabih","doi":"10.1109/CVPR46437.2021.00234","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00234","url":null,"abstract":"Image extrapolation extends an input image beyond the originally-captured field of view. Existing methods struggle to extrapolate images with salient objects in the foreground or are limited to very specific objects such as humans, but tend to work well on indoor/outdoor scenes. We introduce OCONet (Object COmpletion Networks) to extrapolate foreground objects, with an object completion network conditioned on its class. OCONet uses an encoder-decoder architecture trained with adversarial loss to predict the object’s texture as well as its extent, represented as a predicted signed-distance field. An independent step extends the background, and the object is composited on top using the predicted mask. Both qualitative and quantitative results show that we improve on state-of-the-art image extrapolation results for challenging examples.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"115 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116460571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Transitional Adaptation of Pretrained Models for Visual Storytelling 视觉叙事预训练模型的过渡适应
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01247
Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim
Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.
以前的视觉-语言生成任务模型通常是在各自的领域预训练一个视觉编码器和一个语言生成器,并与目标任务共同对它们进行微调。然而,这种直接迁移实践可能会受到视觉特异性和语言流畅性之间的不协调的影响,因为它们通常是在没有共同点的视觉和文本数据的大型语料库中单独训练的。在这项工作中,我们声称需要在预训练和微调之间进行过渡适应任务,以协调视觉编码器和语言模型,以挑战视觉讲故事等下游目标任务。我们提出了一种新的方法,称为预训练模型的过渡适应(TAPM),它使多模态模块相互适应,在视觉输入之间的对齐任务更简单,而不需要文本标签。通过大量的实验,我们表明自适应步骤显著提高了多语言模型在序列视频和图像字幕任务中的性能。我们在LSMDC 2019的多句描述任务[50]和VIST的图像叙事任务[18]中实现了语言指标和人类评估方面的最新性能。我们的实验表明,标题质量的提高并不取决于语言模型的具体选择。
{"title":"Transitional Adaptation of Pretrained Models for Visual Storytelling","authors":"Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim","doi":"10.1109/CVPR46437.2021.01247","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01247","url":null,"abstract":"Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114795287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Quality-Agnostic Image Recognition via Invertible Decoder 基于可逆解码器的质量不可知图像识别
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01208
Insoo Kim, S. Han, Ji-Won Baek, Seong-Jin Park, Jae-Joon Han, Jinwoo Shin
Despite the remarkable performance of deep models on image recognition tasks, they are known to be susceptible to common corruptions such as blur, noise, and low-resolution. Data augmentation is a conventional way to build a robust model by considering these common corruptions during the training. However, a naive data augmentation scheme may result in a non-specialized model for particular corruptions, as the model tends to learn the averaged distribution among corruptions. To mitigate the issue, we propose a new paradigm of training deep image recognition networks that produce clean-like features from any quality image via an invertible neural architecture. The proposed method consists of two stages. In the first stage, we train an invertible network with only clean images under the recognition objective. In the second stage, its inversion, i.e., the invertible decoder, is attached to a new recognition network and we train this encoder-decoder network using both clean and corrupted images by considering recognition and reconstruction objectives. Our two-stage scheme allows the network to produce clean-like and robust features from any quality images, by reconstructing their clean images via the invertible decoder. We demonstrate the effectiveness of our method on image classification and face recognition tasks.
尽管深度模型在图像识别任务上表现出色,但众所周知,它们容易受到模糊、噪声和低分辨率等常见问题的影响。通过在训练期间考虑这些常见的错误,数据增强是构建健壮模型的一种传统方法。然而,幼稚的数据增强方案可能会导致针对特定损坏的非专业模型,因为该模型倾向于学习损坏之间的平均分布。为了缓解这个问题,我们提出了一种新的训练深度图像识别网络的范例,该网络通过可逆的神经结构从任何质量的图像中产生类似清洁的特征。该方法分为两个阶段。在第一阶段,我们在识别目标下训练一个只有干净图像的可逆网络。在第二阶段,它的反转,即可逆解码器,附加到一个新的识别网络上,我们使用干净和损坏的图像来训练这个编码器-解码器网络,同时考虑识别和重建目标。我们的两阶段方案允许网络通过可逆解码器重建干净的图像,从任何质量的图像中产生干净的和鲁棒的特征。我们证明了我们的方法在图像分类和人脸识别任务上的有效性。
{"title":"Quality-Agnostic Image Recognition via Invertible Decoder","authors":"Insoo Kim, S. Han, Ji-Won Baek, Seong-Jin Park, Jae-Joon Han, Jinwoo Shin","doi":"10.1109/CVPR46437.2021.01208","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01208","url":null,"abstract":"Despite the remarkable performance of deep models on image recognition tasks, they are known to be susceptible to common corruptions such as blur, noise, and low-resolution. Data augmentation is a conventional way to build a robust model by considering these common corruptions during the training. However, a naive data augmentation scheme may result in a non-specialized model for particular corruptions, as the model tends to learn the averaged distribution among corruptions. To mitigate the issue, we propose a new paradigm of training deep image recognition networks that produce clean-like features from any quality image via an invertible neural architecture. The proposed method consists of two stages. In the first stage, we train an invertible network with only clean images under the recognition objective. In the second stage, its inversion, i.e., the invertible decoder, is attached to a new recognition network and we train this encoder-decoder network using both clean and corrupted images by considering recognition and reconstruction objectives. Our two-stage scheme allows the network to produce clean-like and robust features from any quality images, by reconstructing their clean images via the invertible decoder. We demonstrate the effectiveness of our method on image classification and face recognition tasks.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116249093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Unsupervised Learning of Depth and Depth-of-Field Effect from Natural Images with Aperture Rendering Generative Adversarial Networks 基于孔径渲染生成对抗网络的自然图像深度和景深效果的无监督学习
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01542
Takuhiro Kaneko
Understanding the 3D world from 2D projected natural images is a fundamental challenge in computer vision and graphics. Recently, an unsupervised learning approach has garnered considerable attention owing to its advantages in data collection. However, to mitigate training limitations, typical methods need to impose assumptions for viewpoint distribution (e.g., a dataset containing various viewpoint images) or object shape (e.g., symmetric objects). These assumptions often restrict applications; for instance, the application to non-rigid objects or images captured from similar viewpoints (e.g., flower or bird images) remains a challenge. To complement these approaches, we propose aperture rendering generative adversarial networks (AR-GANs), which equip aperture rendering on top of GANs, and adopt focus cues to learn the depth and depth-of-field (DoF) effect of unlabeled natural images. To address the ambiguities triggered by unsupervised setting (i.e., ambiguities between smooth texture and out-of-focus blurs, and between foreground and background blurs), we develop DoF mixture learning, which enables the generator to learn real image distribution while generating diverse DoF images. In addition, we devise a center focus prior to guiding the learning direction. In the experiments, we demonstrate the effectiveness of AR-GANs in various datasets, such as flower, bird, and face images, demonstrate their portability by incorporating them into other 3D representation learning GANs, and validate their applicability in shallow DoF rendering.
从2D投影的自然图像中理解3D世界是计算机视觉和图形学的一个基本挑战。近年来,一种无监督学习方法因其在数据收集方面的优势而引起了人们的广泛关注。然而,为了减轻训练限制,典型的方法需要对视点分布(例如,包含各种视点图像的数据集)或对象形状(例如,对称对象)施加假设。这些假设通常会限制应用;例如,应用于非刚性对象或从类似视点捕获的图像(例如,花或鸟图像)仍然是一个挑战。为了补充这些方法,我们提出了孔径渲染生成对抗网络(AR-GANs),该网络在gan的基础上进行孔径渲染,并采用焦点线索来学习未标记自然图像的深度和景深(DoF)效应。为了解决由无监督设置引发的模糊性(即平滑纹理和失焦模糊之间的模糊性,以及前景和背景模糊之间的模糊性),我们开发了DoF混合学习,使生成器能够在生成不同DoF图像的同时学习真实图像分布。此外,在指导学习方向之前,我们设计了一个中心焦点。在实验中,我们证明了ar - gan在各种数据集(如花、鸟和人脸图像)中的有效性,并通过将其合并到其他3D表示学习gan中来证明其可移植性,并验证了其在浅自由度渲染中的适用性。
{"title":"Unsupervised Learning of Depth and Depth-of-Field Effect from Natural Images with Aperture Rendering Generative Adversarial Networks","authors":"Takuhiro Kaneko","doi":"10.1109/CVPR46437.2021.01542","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01542","url":null,"abstract":"Understanding the 3D world from 2D projected natural images is a fundamental challenge in computer vision and graphics. Recently, an unsupervised learning approach has garnered considerable attention owing to its advantages in data collection. However, to mitigate training limitations, typical methods need to impose assumptions for viewpoint distribution (e.g., a dataset containing various viewpoint images) or object shape (e.g., symmetric objects). These assumptions often restrict applications; for instance, the application to non-rigid objects or images captured from similar viewpoints (e.g., flower or bird images) remains a challenge. To complement these approaches, we propose aperture rendering generative adversarial networks (AR-GANs), which equip aperture rendering on top of GANs, and adopt focus cues to learn the depth and depth-of-field (DoF) effect of unlabeled natural images. To address the ambiguities triggered by unsupervised setting (i.e., ambiguities between smooth texture and out-of-focus blurs, and between foreground and background blurs), we develop DoF mixture learning, which enables the generator to learn real image distribution while generating diverse DoF images. In addition, we devise a center focus prior to guiding the learning direction. In the experiments, we demonstrate the effectiveness of AR-GANs in various datasets, such as flower, bird, and face images, demonstrate their portability by incorporating them into other 3D representation learning GANs, and validate their applicability in shallow DoF rendering.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"284 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116441372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Information Bottleneck Disentanglement for Identity Swapping 身份交换中的信息瓶颈解纠缠
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00341
Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, R. He
Improving the performance of face forgery detectors often requires more identity-swapped images of higher-quality. One core objective of identity swapping is to generate identity-discriminative faces that are distinct from the target while identical to the source. To this end, properly disentangling identity and identity-irrelevant information is critical and remains a challenging endeavor. In this work, we propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck tradeoff, in terms of finding an optimal compression of the pretrained latent features. Moreover, a novel identity contrastive loss is proposed for further disentanglement by requiring a proper distance between the generated identity and the target. While the most prior works have focused on using various loss functions to implicitly guide the learning of representations, we demonstrate that our model can provide explicit supervision for learning disentangled representations, achieving impressive performance in generating more identity-discriminative swapped faces.
提高人脸伪造检测器的性能通常需要更多高质量的身份交换图像。身份交换的一个核心目标是生成与目标不同而与源相同的身份判别脸。为此,正确地分离身份和与身份无关的信息至关重要,并且仍然是一项具有挑战性的工作。在这项工作中,我们提出了一种新的信息解纠缠和交换网络,称为InfoSwap,从预训练的人脸识别模型中提取最具表现力的信息用于身份表示。我们的方法的关键见解是将解纠缠表示的学习表述为优化信息瓶颈权衡,以找到预训练潜在特征的最佳压缩。此外,为了进一步解除纠缠,提出了一种新的身份对比损失,该损失要求生成的身份与目标之间有适当的距离。虽然大多数先前的工作都集中在使用各种损失函数来隐式指导表征的学习,但我们证明了我们的模型可以为学习解纠缠表征提供显式监督,在生成更多身份判别交换面方面取得了令人印象深刻的性能。
{"title":"Information Bottleneck Disentanglement for Identity Swapping","authors":"Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, R. He","doi":"10.1109/CVPR46437.2021.00341","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00341","url":null,"abstract":"Improving the performance of face forgery detectors often requires more identity-swapped images of higher-quality. One core objective of identity swapping is to generate identity-discriminative faces that are distinct from the target while identical to the source. To this end, properly disentangling identity and identity-irrelevant information is critical and remains a challenging endeavor. In this work, we propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck tradeoff, in terms of finding an optimal compression of the pretrained latent features. Moreover, a novel identity contrastive loss is proposed for further disentanglement by requiring a proper distance between the generated identity and the target. While the most prior works have focused on using various loss functions to implicitly guide the learning of representations, we demonstrate that our model can provide explicit supervision for learning disentangled representations, achieving impressive performance in generating more identity-discriminative swapped faces.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115452384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
RPN Prototype Alignment For Domain Adaptive Object Detector 面向领域自适应目标检测器的RPN原型对齐
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01224
Y. Zhang, Zilei Wang, Yushi Mao
Recent years have witnessed great progress of object detection. However, due to the domain shift problem, applying the knowledge of an object detector learned from one specific domain to another one often suffers severe performance degradation. Most existing methods adopt feature alignment either on the backbone network or instance classifier to increase the transferability of object detector. Differently, we propose to perform feature alignment in the RPN stage such that the foreground and background RPN proposals in target domain can be effectively distinguished. Specifically, we first construct one set of learnable RPN prototpyes, and then enforce the RPN features to align with the prototypes for both source and target domains. It essentially cooperates the learning of RPN prototypes and features to align the source and target RPN features. Particularly, we propose a simple yet effective method suitable for RPN feature alignment to generate high-quality pseudo label of proposals in target domain, i.e., using the filtered detection results with IoU. Furthermore, we adopt Grad CAM to find the discriminative region within a foreground proposal and use it to increase the discriminability of RPN features for alignment. We conduct extensive experiments on multiple cross-domain detection scenarios, and the results show the effectiveness of our proposed method against previous state-of-the-art methods.
近年来,目标检测技术取得了很大的进展。然而,由于领域转移问题,将目标检测器从一个特定领域学习到的知识应用到另一个特定领域往往会导致严重的性能下降。现有方法大多采用骨干网或实例分类器上的特征对齐来提高目标检测器的可移植性。不同的是,我们提出在RPN阶段进行特征对齐,从而有效区分目标域的前景和背景RPN提议。具体来说,我们首先构建一组可学习的RPN原型,然后强制RPN特征与源和目标领域的原型保持一致。它本质上配合RPN原型和特征的学习,以对齐源和目标RPN特征。特别地,我们提出了一种简单而有效的适合于RPN特征对齐的方法,即利用IoU过滤后的检测结果在目标域中生成高质量的提案伪标签。在此基础上,采用梯度CAM方法在前景图中寻找可分辨区域,提高RPN特征的可分辨性。我们在多个跨域检测场景下进行了广泛的实验,结果表明我们提出的方法与以前最先进的方法相比是有效的。
{"title":"RPN Prototype Alignment For Domain Adaptive Object Detector","authors":"Y. Zhang, Zilei Wang, Yushi Mao","doi":"10.1109/CVPR46437.2021.01224","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01224","url":null,"abstract":"Recent years have witnessed great progress of object detection. However, due to the domain shift problem, applying the knowledge of an object detector learned from one specific domain to another one often suffers severe performance degradation. Most existing methods adopt feature alignment either on the backbone network or instance classifier to increase the transferability of object detector. Differently, we propose to perform feature alignment in the RPN stage such that the foreground and background RPN proposals in target domain can be effectively distinguished. Specifically, we first construct one set of learnable RPN prototpyes, and then enforce the RPN features to align with the prototypes for both source and target domains. It essentially cooperates the learning of RPN prototypes and features to align the source and target RPN features. Particularly, we propose a simple yet effective method suitable for RPN feature alignment to generate high-quality pseudo label of proposals in target domain, i.e., using the filtered detection results with IoU. Furthermore, we adopt Grad CAM to find the discriminative region within a foreground proposal and use it to increase the discriminability of RPN features for alignment. We conduct extensive experiments on multiple cross-domain detection scenarios, and the results show the effectiveness of our proposed method against previous state-of-the-art methods.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123816461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
DeepACG: Co-Saliency Detection via Semantic-aware Contrast Gromov-Wasserstein Distance DeepACG:基于语义感知对比Gromov-Wasserstein距离的共显著性检测
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01349
Kaihua Zhang, Mengming Michael Dong, Bo Liu, Xiaotong Yuan, Qingshan Liu
The objective of co-saliency detection is to segment the co-occurring salient objects in a group of images. To address this task, we introduce a new deep network architecture via semantic-aware contrast Gromov-Wasserstein distance (DeepACG). We first adopt the Gromov-Wasserstein (GW) distance to build dense 4D correlation volumes for all pairs of image pixels within the image group. These dense correlation volumes enable the network to accurately discover the structured pair-wise pixel similarities among the common salient objects. Second, we develop a semantic-aware co-attention module (SCAM) to enhance the foreground co-saliency through predicted categorical information. Specifically, SCAM recognizes the semantic class of the foreground co-objects, and this information is then modulated to the deep representations to localize the related pixels. Third, we design a contrast edge-enhanced module (EEM) to capture richer contexts and preserve fine-grained spatial information. We validate the effectiveness of our model using three largest and most challenging benchmark datasets (Cosal2015, CoCA, and CoSOD3k). Extensive experiments have demonstrated the substantial practical merit of each module. Compared with the existing works, DeepACG shows significant improvements and achieves state-of-the-art performance.
共同显著性检测的目的是对一组图像中共同出现的显著性物体进行分割。为了解决这个问题,我们通过语义感知对比Gromov-Wasserstein距离(DeepACG)引入了一种新的深度网络架构。我们首先采用Gromov-Wasserstein (GW)距离对图像组内所有对图像像素建立密集的四维相关体。这些密集的相关体积使网络能够准确地发现共同显著对象之间的结构化成对像素相似性。其次,我们开发了一个语义感知的共注意模块(SCAM),通过预测的分类信息来增强前景共显著性。具体来说,SCAM识别前景协同对象的语义类,然后将这些信息调制到深度表示中以定位相关像素。第三,我们设计了一个对比度边缘增强模块(EEM)来捕获更丰富的上下文并保留细粒度的空间信息。我们使用三个最大和最具挑战性的基准数据集(Cosal2015, CoCA和CoSOD3k)验证了我们模型的有效性。大量的实验证明了每个模块的实际价值。与现有的工作相比,DeepACG有了显著的改进,达到了最先进的性能。
{"title":"DeepACG: Co-Saliency Detection via Semantic-aware Contrast Gromov-Wasserstein Distance","authors":"Kaihua Zhang, Mengming Michael Dong, Bo Liu, Xiaotong Yuan, Qingshan Liu","doi":"10.1109/CVPR46437.2021.01349","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01349","url":null,"abstract":"The objective of co-saliency detection is to segment the co-occurring salient objects in a group of images. To address this task, we introduce a new deep network architecture via semantic-aware contrast Gromov-Wasserstein distance (DeepACG). We first adopt the Gromov-Wasserstein (GW) distance to build dense 4D correlation volumes for all pairs of image pixels within the image group. These dense correlation volumes enable the network to accurately discover the structured pair-wise pixel similarities among the common salient objects. Second, we develop a semantic-aware co-attention module (SCAM) to enhance the foreground co-saliency through predicted categorical information. Specifically, SCAM recognizes the semantic class of the foreground co-objects, and this information is then modulated to the deep representations to localize the related pixels. Third, we design a contrast edge-enhanced module (EEM) to capture richer contexts and preserve fine-grained spatial information. We validate the effectiveness of our model using three largest and most challenging benchmark datasets (Cosal2015, CoCA, and CoSOD3k). Extensive experiments have demonstrated the substantial practical merit of each module. Compared with the existing works, DeepACG shows significant improvements and achieves state-of-the-art performance.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123219226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
期刊
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1