Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, Yang Yang
{"title":"Scene graph generation via multi-relation classification and cross-modal attention coordinator","authors":"Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, Yang Yang","doi":"10.1145/3444685.3446276","DOIUrl":null,"url":null,"abstract":"Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes (e.g. on), yet leaving poor performance with fine-grained and infrequent ones (e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3444685.3446276","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes (e.g. on), yet leaving poor performance with fine-grained and infrequent ones (e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.
场景图生成旨在从图像中构建基于图的表示,其中节点和边分别表示对象及其之间的关系。然而,今天的场景图生成受到不平衡的类别预测的严重限制。具体来说,大多数现有的工作在简单和频繁的关系类(例如on)上实现了令人满意的性能,但是在细粒度和不频繁的关系类(例如walk on, stand on)上留下了较差的性能。为了解决这个问题,本文将框架重新设计为两个分支,表示学习分支和分类器学习分支,以获得更平衡的场景图生成器。此外,对于表征学习分支,我们提出了跨模态注意协调器(Cross-modal Attention Coordinator, CAC),利用动态注意从多模态中收集一致的特征。对于分类器学习分支,我们首先从大规模语料库中迁移关系类的知识,然后通过图注意网络(MR-GAT)利用多关系分类器来弥合频繁关系和不频繁关系之间的差距。在挑战数据集VG200上的综合实验结果表明,本文提出的方法具有竞争力和显著的优越性。