Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection

2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2021-10-01 DOI:10.1109/iccv48922.2021.01561

Markos Diomataris, N. Gkanatsios, Vassilis Pitsikalis, P. Maragos

{"title":"Grounding Consistency: Distilling Spatial Common Sense for Precise Visual Relationship Detection","authors":"Markos Diomataris, N. Gkanatsios, Vassilis Pitsikalis, P. Maragos","doi":"10.1109/iccv48922.2021.01561","DOIUrl":null,"url":null,"abstract":"Scene Graph Generators (SGGs) are models that, given an image, build a directed graph where each edge represents a predicted subject predicate object triplet. Most SGGs silently exploit datasets' bias on relationships' context, i.e. its subject and object, to improve recall and neglect spatial and visual evidence, e.g. having seen a glut of data for person wearing shirt, they are overconfident that every person is wearing every shirt. Such imprecise predictions are mainly ascribed to the lack of negative examples for most relationships, which obstructs models from meaningfully learning predicates, even those that have ample positive examples. We first present an indepth investigation of the context bias issue to showcase that all examined state-of-the-art SGGs share the above vulnerabilities. In response, we propose a semi-supervised scheme that forces predicted triplets to be grounded consistently back to the image, in a closed-loop manner. The developed spatial common sense can be then distilled to a student SGG and substantially enhance its spatial reasoning ability. This Grounding Consistency Distillation (GCD) approach is model-agnostic and benefits from the superfluous unlabeled samples to retain the valuable context information and avert memorization of annotations. Furthermore, we demonstrate that current metrics disregard unlabeled samples, rendering themselves incapable of reflecting context bias, then we mine and incorporate during evaluation hard-negatives to reformulate precision as a reliable metric. Extensive experimental comparisons exhibit large quantitative - up to 70% relative precision boost on VG200 dataset - and qualitative improvements to prove the significance of our GCD method and our metrics towards refocusing graph generation as a core aspect of scene understanding. Code available at https://github.com/deeplab-ai/grounding-consistent-vrd.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"25 1","pages":"15891-15900"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iccv48922.2021.01561","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Scene Graph Generators (SGGs) are models that, given an image, build a directed graph where each edge represents a predicted subject predicate object triplet. Most SGGs silently exploit datasets' bias on relationships' context, i.e. its subject and object, to improve recall and neglect spatial and visual evidence, e.g. having seen a glut of data for person wearing shirt, they are overconfident that every person is wearing every shirt. Such imprecise predictions are mainly ascribed to the lack of negative examples for most relationships, which obstructs models from meaningfully learning predicates, even those that have ample positive examples. We first present an indepth investigation of the context bias issue to showcase that all examined state-of-the-art SGGs share the above vulnerabilities. In response, we propose a semi-supervised scheme that forces predicted triplets to be grounded consistently back to the image, in a closed-loop manner. The developed spatial common sense can be then distilled to a student SGG and substantially enhance its spatial reasoning ability. This Grounding Consistency Distillation (GCD) approach is model-agnostic and benefits from the superfluous unlabeled samples to retain the valuable context information and avert memorization of annotations. Furthermore, we demonstrate that current metrics disregard unlabeled samples, rendering themselves incapable of reflecting context bias, then we mine and incorporate during evaluation hard-negatives to reformulate precision as a reliable metric. Extensive experimental comparisons exhibit large quantitative - up to 70% relative precision boost on VG200 dataset - and qualitative improvements to prove the significance of our GCD method and our metrics towards refocusing graph generation as a core aspect of scene understanding. Code available at https://github.com/deeplab-ai/grounding-consistent-vrd.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基础一致性:为精确的视觉关系检测提取空间常识

场景图生成器(SGGs)是一种模型，给定图像，构建一个有向图，其中每个边表示预测的主谓宾三元组。大多数sgg默默地利用数据集对关系上下文(即其主体和客体)的偏见来提高回忆，而忽略了空间和视觉证据，例如，在看到关于穿衬衫的人的大量数据后，他们过于自信，认为每个人都穿了每件衬衫。这种不精确的预测主要归因于大多数关系缺乏负面例子，这阻碍了模型有意义地学习谓词，即使是那些有充足正面例子的模型。我们首先对上下文偏差问题进行了深入调查，以展示所有经过检查的最先进的sgg都具有上述漏洞。作为回应，我们提出了一种半监督方案，迫使预测的三联体以闭环方式一致地接地回图像。发展的空间常识可以提炼到学生的SGG中，大大提高其空间推理能力。这种基础一致性蒸馏(GCD)方法是模型不可知的，并且受益于多余的未标记样本来保留有价值的上下文信息并避免注释的记忆。此外，我们证明了当前的指标忽略了未标记的样本，使它们无法反映上下文偏差，然后我们在评估硬否定时挖掘和合并，以将精度重新制定为可靠的指标。广泛的实验比较显示了大量的定量-在VG200数据集上高达70%的相对精度提升-和定性改进，以证明我们的GCD方法和我们的度量对重新聚焦图形生成作为场景理解的核心方面的重要性。代码可从https://github.com/deeplab-ai/grounding-consistent-vrd获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量

期刊最新文献

Naturalistic Physical Adversarial Patch for Object Detectors Polarimetric Helmholtz Stereopsis Deep Transport Network for Unsupervised Video Object Segmentation Real-time Vanishing Point Detector Integrating Under-parameterized RANSAC and Hough Transform Adaptive Label Noise Cleaning with Meta-Supervision for Deep Face Recognition