Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning

Mingyan Wu, Shuhan Qi, Jun Rao, Jia-jia Zhang, Qing Liao, Xuan Wang, Xinxin Liao
{"title":"Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning","authors":"Mingyan Wu, Shuhan Qi, Jun Rao, Jia-jia Zhang, Qing Liao, Xuan Wang, Xinxin Liao","doi":"10.1145/3475731.3484957","DOIUrl":null,"url":null,"abstract":"Visual commonsense reasoning (VCR) task aims at boosting research of cognition-level correlations reasoning. It requires not only a thorough understanding of correlated details of the scene but also the ability to infer correlation with related commonsense knowledge. Existing approaches consider the region-word affinity to perform the semantic alignment between vision and linguistic domains, which neglect the implicit correspondence (e.g. word-scene, region-phrase, and phrase-scene) among visual concepts and linguistic words. Although efforts have been made to deliver promising results in previous work, these methods are still confronted with challenges when comes to make interpretable reasoning. Toward this end, we present a novel hierarchical semantic enhanced directional graph network. To be more specific, we design a Modality Interaction Unit (MIU) module, which captures high-order cross-modal alignment by aggregating the hierarchical vision-language relationships. Afterward, we propose a direction clue-aware graph reasoning (DCGR) module. In this module, valuable entities can be dynamically selected in each reasoning step, according to the importance of these entities. This leads to a more interpretable reasoning procedure. Ultimately, heterogeneous graph attention is introduced to filter the irrelevant parts of the final answers. Extensive experiments have been conducted on the VCR benchmark dataset, which demonstrates that our method can achieve competitive results and better interpretability compared with several state-of-the-art baselines.","PeriodicalId":355632,"journal":{"name":"Proceedings of the 1st International Workshop on Trustworthy AI for Multimedia Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International Workshop on Trustworthy AI for Multimedia Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3475731.3484957","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Visual commonsense reasoning (VCR) task aims at boosting research of cognition-level correlations reasoning. It requires not only a thorough understanding of correlated details of the scene but also the ability to infer correlation with related commonsense knowledge. Existing approaches consider the region-word affinity to perform the semantic alignment between vision and linguistic domains, which neglect the implicit correspondence (e.g. word-scene, region-phrase, and phrase-scene) among visual concepts and linguistic words. Although efforts have been made to deliver promising results in previous work, these methods are still confronted with challenges when comes to make interpretable reasoning. Toward this end, we present a novel hierarchical semantic enhanced directional graph network. To be more specific, we design a Modality Interaction Unit (MIU) module, which captures high-order cross-modal alignment by aggregating the hierarchical vision-language relationships. Afterward, we propose a direction clue-aware graph reasoning (DCGR) module. In this module, valuable entities can be dynamically selected in each reasoning step, according to the importance of these entities. This leads to a more interpretable reasoning procedure. Ultimately, heterogeneous graph attention is introduced to filter the irrelevant parts of the final answers. Extensive experiments have been conducted on the VCR benchmark dataset, which demonstrates that our method can achieve competitive results and better interpretability compared with several state-of-the-art baselines.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向视觉常识推理的层次语义增强方向图网络
视觉常识推理(VCR)任务旨在促进认知水平相关性推理的研究。它不仅需要对场景的相关细节有透彻的了解,还需要有能力用相关的常识推断出相关性。现有的方法考虑了区域-词的亲和力来实现视觉和语言领域之间的语义对齐,忽略了视觉概念和语言词汇之间的隐式对应关系(如词-场景、区域-短语和短语-场景)。尽管在以前的工作中已经做出了努力,取得了有希望的结果,但这些方法在进行可解释推理时仍然面临挑战。为此,我们提出了一种新的分层语义增强的方向图网络。更具体地说,我们设计了一个模态交互单元(MIU)模块,该模块通过聚合层次视觉语言关系来捕获高阶跨模态对齐。然后,我们提出了一个方向线索感知图推理(DCGR)模块。在该模块中,可以根据实体的重要性,在每个推理步骤中动态选择有价值的实体。这导致了一个更易于解释的推理过程。最后,引入异构图注意来过滤最终答案中不相关的部分。在VCR基准数据集上进行了大量实验,结果表明,与几种最先进的基线相比,我们的方法可以获得具有竞争力的结果和更好的可解释性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning Patch Replacement: A Transformation-based Method to Improve Robustness against Adversarial Attacks Dataset Diversity: Measuring and Mitigating Geographical Bias in Image Search and Retrieval An Empirical Study of Uncertainty Gap for Disentangling Factors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1