Visual-Semantic Matching by Exploring High-Order Attention and Distraction

Yongzhi Li, Duo Zhang, Yadong Mu
{"title":"Visual-Semantic Matching by Exploring High-Order Attention and Distraction","authors":"Yongzhi Li, Duo Zhang, Yadong Mu","doi":"10.1109/cvpr42600.2020.01280","DOIUrl":null,"url":null,"abstract":"Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"15 1","pages":"12783-12792"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/cvpr42600.2020.01280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23

Abstract

Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
探索高阶注意和分心的视觉语义匹配
跨情态语义匹配是计算机视觉中的一项重要任务,近年来受到越来越多的关注。现有的方法主要是探索图像对象与文本单词之间基于对象的对齐。在这项工作中,我们从两个以前被忽视的方面来解决这个问题:高阶语义信息(例如,宾语-谓语-主语三元组,宾语-属性对)和视觉分心(即,尽管与文本查询高度相关,图像也可能包含许多突出的分心对象或视觉关系)。具体来说,我们为视觉和文本模式构建场景图。我们的技术贡献有两个方面:首先,我们将视觉语义匹配任务表述为一个注意驱动的跨模态场景图匹配问题。图卷积网络(GCNs)用于从两个场景图中提取高阶信息。提出了一种新的跨图注意机制,根据上下文重新权衡图元素并计算图间相似度;其次,一些排名靠前的样本确实是错误匹配,因为高相关和分散的信息同时出现。我们设计了一种信息论方法来估计语义分心并对初始检索结果重新排序。在MS-COCO和Flickr30K两个大型公共数据集上的综合实验和烧烧研究证明了该方法的优越性以及高阶注意和分心的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Geometric Structure Based and Regularized Depth Estimation From 360 Indoor Imagery 3D Part Guided Image Editing for Fine-Grained Object Understanding SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation Approximating shapes in images with low-complexity polygons PFRL: Pose-Free Reinforcement Learning for 6D Pose Estimation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1