结合全局相似度和局部相似度的跨媒体图像-文本检索

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA) Pub Date : 2019-10-01 DOI:10.1109/DSAA.2019.00029

Zhixin Li, Feng Ling, Canlong Zhang

{"title":"结合全局相似度和局部相似度的跨媒体图像-文本检索","authors":"Zhixin Li, Feng Ling, Canlong Zhang","doi":"10.1109/DSAA.2019.00029","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem of image-text matching in order to make the image and text have better semantic matching. In the previous work, people just simply used the pre-training network to extract image and text features and project directly into a common subspace, or change various loss functions on this basis, or use the attention mechanism to directly match the image region proposals and the text phrases. This is not a good match for the semantics of the image and the text. In this study, we propose a method of cross-media retrieval based on global representation and local representation. We constructed a cross-media two-level network to explore better semantic matching between images and text, which contains subnets that handle both global and local features. Specifically, we not only use the self-attention network to obtain a macro representation of the global image but also use the local fine-grained patch with the attention mechanism. Then, we use a two-level alignment framework to promote each other to learn different representations of cross-media retrieval. The innovation of this study lies in the use of more comprehensive features of image and text to design the two kinds of similarity and add them up in some way. Experimental results show that this method is effective in image-text retrieval. Experimental results on the Flickr30K and MS-COCO datasets show that this model has a better recall rate than many of the current advanced cross-media retrieval models.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Cross-Media Image-Text Retrieval Combined with Global Similarity and Local Similarity\",\"authors\":\"Zhixin Li, Feng Ling, Canlong Zhang\",\"doi\":\"10.1109/DSAA.2019.00029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we study the problem of image-text matching in order to make the image and text have better semantic matching. In the previous work, people just simply used the pre-training network to extract image and text features and project directly into a common subspace, or change various loss functions on this basis, or use the attention mechanism to directly match the image region proposals and the text phrases. This is not a good match for the semantics of the image and the text. In this study, we propose a method of cross-media retrieval based on global representation and local representation. We constructed a cross-media two-level network to explore better semantic matching between images and text, which contains subnets that handle both global and local features. Specifically, we not only use the self-attention network to obtain a macro representation of the global image but also use the local fine-grained patch with the attention mechanism. Then, we use a two-level alignment framework to promote each other to learn different representations of cross-media retrieval. The innovation of this study lies in the use of more comprehensive features of image and text to design the two kinds of similarity and add them up in some way. Experimental results show that this method is effective in image-text retrieval. Experimental results on the Flickr30K and MS-COCO datasets show that this model has a better recall rate than many of the current advanced cross-media retrieval models.\",\"PeriodicalId\":416037,\"journal\":{\"name\":\"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSAA.2019.00029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2019.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

为了使图像和文本具有更好的语义匹配，本文研究了图像和文本的匹配问题。在之前的工作中，人们只是简单地使用预训练网络提取图像和文本特征并直接投影到公共子空间中，或者在此基础上改变各种损失函数，或者使用注意机制直接匹配图像区域建议和文本短语。这与图像和文本的语义不太匹配。在本研究中，我们提出了一种基于全局表示和局部表示的跨媒体检索方法。我们构建了一个跨媒体两级网络来探索图像和文本之间更好的语义匹配，该网络包含处理全局和局部特征的子网。具体来说，我们既使用自注意网络获得全局图像的宏观表示，又使用局部细粒度补丁与注意机制相结合。然后，我们使用一个两级对齐框架来相互促进学习跨媒体检索的不同表示。本研究的创新之处在于利用图像和文本更全面的特征来设计两种相似度，并在某种程度上加以叠加。实验结果表明，该方法在图像文本检索中是有效的。在Flickr30K和MS-COCO数据集上的实验结果表明，该模型比目前许多先进的跨媒体检索模型具有更好的查全率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cross-Media Image-Text Retrieval Combined with Global Similarity and Local Similarity

In this paper, we study the problem of image-text matching in order to make the image and text have better semantic matching. In the previous work, people just simply used the pre-training network to extract image and text features and project directly into a common subspace, or change various loss functions on this basis, or use the attention mechanism to directly match the image region proposals and the text phrases. This is not a good match for the semantics of the image and the text. In this study, we propose a method of cross-media retrieval based on global representation and local representation. We constructed a cross-media two-level network to explore better semantic matching between images and text, which contains subnets that handle both global and local features. Specifically, we not only use the self-attention network to obtain a macro representation of the global image but also use the local fine-grained patch with the attention mechanism. Then, we use a two-level alignment framework to promote each other to learn different representations of cross-media retrieval. The innovation of this study lies in the use of more comprehensive features of image and text to design the two kinds of similarity and add them up in some way. Experimental results show that this method is effective in image-text retrieval. Experimental results on the Flickr30K and MS-COCO datasets show that this model has a better recall rate than many of the current advanced cross-media retrieval models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

自引率

0.00%

发文量

期刊最新文献

A Rapid Prototyping Approach for High Performance Density-Based Clustering Automating Big Data Analysis Based on Deep Learning Generation by Automatic Service Composition Detecting Sensitive Content in Spoken Language Improving the Personalized Recommendation in the Cold-start Scenarios Colorwall: An Embedded Temporal Display of Bibliographic Data