Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention

Xin Hu, Lingling Zhang, Jun Liu, Xinyu Zhang, Wenjun Wu, Qianying Wang
{"title":"Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention","authors":"Xin Hu, Lingling Zhang, Jun Liu, Xinyu Zhang, Wenjun Wu, Qianying Wang","doi":"10.24963/ijcai.2023/93","DOIUrl":null,"url":null,"abstract":"Diagram visual grounding aims to capture the correlation between language expression and local objects in the diagram, and plays an important role in the applications like textbook question answering and cross-modal retrieval. Most diagrams consist of several colors and simple geometries. This results in sparse low-level visual features, which further aggravates the gap between low-level visual and high-level semantic features of diagrams. The phenomenon brings challenges to the diagram visual grounding. To solve the above issues, we propose a gestalt-perceptual attention model to align the diagram objects and language expressions. For low-level visual features, inspired by the gestalt that simulates human visual system, we build a gestalt-perception graph network to make up the features learned by the traditional backbone network. For high-level semantic features, we design a multi-modal context attention mechanism to facilitate the interaction between diagrams and language expressions, so as to enhance the semantics of diagrams. Finally, guided by diagram features and linguistic embedding, the target query is gradually decoded to generate the coordinates of the referred object. By conducting comprehensive experiments on diagrams and natural images, we demonstrate that the proposed model achieves superior performance over the competitors. Our code will be released at https://github.com/AIProCode/GPA.","PeriodicalId":394530,"journal":{"name":"International Joint Conference on Artificial Intelligence","volume":"53 91 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Joint Conference on Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24963/ijcai.2023/93","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Diagram visual grounding aims to capture the correlation between language expression and local objects in the diagram, and plays an important role in the applications like textbook question answering and cross-modal retrieval. Most diagrams consist of several colors and simple geometries. This results in sparse low-level visual features, which further aggravates the gap between low-level visual and high-level semantic features of diagrams. The phenomenon brings challenges to the diagram visual grounding. To solve the above issues, we propose a gestalt-perceptual attention model to align the diagram objects and language expressions. For low-level visual features, inspired by the gestalt that simulates human visual system, we build a gestalt-perception graph network to make up the features learned by the traditional backbone network. For high-level semantic features, we design a multi-modal context attention mechanism to facilitate the interaction between diagrams and language expressions, so as to enhance the semantics of diagrams. Finally, guided by diagram features and linguistic embedding, the target query is gradually decoded to generate the coordinates of the referred object. By conducting comprehensive experiments on diagrams and natural images, we demonstrate that the proposed model achieves superior performance over the competitors. Our code will be released at https://github.com/AIProCode/GPA.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
图解视觉基础:学习用格式塔知觉注意看
图的视觉基础旨在捕捉语言表达与图中局部对象之间的相关性,在教科书问答和跨模态检索等应用中发挥着重要作用。大多数图表由几种颜色和简单的几何图形组成。这导致了稀疏的低级视觉特征,进一步加剧了图的低级视觉特征和高级语义特征之间的差距。这种现象给图表的视觉基础带来了挑战。为了解决上述问题,我们提出了一种格式塔-知觉注意模型来对齐图对象和语言表达。对于底层的视觉特征,受完形模拟人类视觉系统的启发,我们构建了一个完形感知图网络来弥补传统骨干网络学习到的特征。对于高级语义特征,我们设计了多模态上下文注意机制,促进图与语言表达式之间的交互,从而增强图的语义。最后,在图特征和语言嵌入的引导下,对目标查询进行逐步解码,生成参考对象的坐标。通过对图表和自然图像的综合实验,我们证明了该模型的性能优于竞争对手。我们的代码将在https://github.com/AIProCode/GPA上发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Towards Formal Verification of Neuro-symbolic Multi-agent Systems RuleMatch: Matching Abstract Rules for Semi-supervised Learning of Human Standard Intelligence Tests Computing (1+epsilon)-Approximate Degeneracy in Sublinear Time AI and Decision Support for Sustainable Socio-Ecosystems Contrastive Learning and Reward Smoothing for Deep Portfolio Management
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1