探索高阶注意和分心的视觉语义匹配

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2020-06-01 DOI:10.1109/cvpr42600.2020.01280

Yongzhi Li, Duo Zhang, Yadong Mu

{"title":"探索高阶注意和分心的视觉语义匹配","authors":"Yongzhi Li, Duo Zhang, Yadong Mu","doi":"10.1109/cvpr42600.2020.01280","DOIUrl":null,"url":null,"abstract":"Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"15 1","pages":"12783-12792"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Visual-Semantic Matching by Exploring High-Order Attention and Distraction\",\"authors\":\"Yongzhi Li, Duo Zhang, Yadong Mu\",\"doi\":\"10.1109/cvpr42600.2020.01280\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.\",\"PeriodicalId\":6715,\"journal\":{\"name\":\"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"15 1\",\"pages\":\"12783-12792\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/cvpr42600.2020.01280\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/cvpr42600.2020.01280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

跨情态语义匹配是计算机视觉中的一项重要任务，近年来受到越来越多的关注。现有的方法主要是探索图像对象与文本单词之间基于对象的对齐。在这项工作中，我们从两个以前被忽视的方面来解决这个问题:高阶语义信息(例如，宾语-谓语-主语三元组，宾语-属性对)和视觉分心(即，尽管与文本查询高度相关，图像也可能包含许多突出的分心对象或视觉关系)。具体来说，我们为视觉和文本模式构建场景图。我们的技术贡献有两个方面:首先，我们将视觉语义匹配任务表述为一个注意驱动的跨模态场景图匹配问题。图卷积网络(GCNs)用于从两个场景图中提取高阶信息。提出了一种新的跨图注意机制，根据上下文重新权衡图元素并计算图间相似度;其次，一些排名靠前的样本确实是错误匹配，因为高相关和分散的信息同时出现。我们设计了一种信息论方法来估计语义分心并对初始检索结果重新排序。在MS-COCO和Flickr30K两个大型公共数据集上的综合实验和烧烧研究证明了该方法的优越性以及高阶注意和分心的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Visual-Semantic Matching by Exploring High-Order Attention and Distraction

Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量

期刊最新文献

Geometric Structure Based and Regularized Depth Estimation From 360 Indoor Imagery 3D Part Guided Image Editing for Fine-Grained Object Understanding SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation Approximating shapes in images with low-complexity polygons PFRL: Pose-Free Reinforcement Learning for 6D Pose Estimation