{"title":"结合全局相似度和局部相似度的跨媒体图像-文本检索","authors":"Zhixin Li, Feng Ling, Canlong Zhang","doi":"10.1109/DSAA.2019.00029","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem of image-text matching in order to make the image and text have better semantic matching. In the previous work, people just simply used the pre-training network to extract image and text features and project directly into a common subspace, or change various loss functions on this basis, or use the attention mechanism to directly match the image region proposals and the text phrases. This is not a good match for the semantics of the image and the text. In this study, we propose a method of cross-media retrieval based on global representation and local representation. We constructed a cross-media two-level network to explore better semantic matching between images and text, which contains subnets that handle both global and local features. Specifically, we not only use the self-attention network to obtain a macro representation of the global image but also use the local fine-grained patch with the attention mechanism. Then, we use a two-level alignment framework to promote each other to learn different representations of cross-media retrieval. The innovation of this study lies in the use of more comprehensive features of image and text to design the two kinds of similarity and add them up in some way. Experimental results show that this method is effective in image-text retrieval. Experimental results on the Flickr30K and MS-COCO datasets show that this model has a better recall rate than many of the current advanced cross-media retrieval models.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Cross-Media Image-Text Retrieval Combined with Global Similarity and Local Similarity\",\"authors\":\"Zhixin Li, Feng Ling, Canlong Zhang\",\"doi\":\"10.1109/DSAA.2019.00029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we study the problem of image-text matching in order to make the image and text have better semantic matching. In the previous work, people just simply used the pre-training network to extract image and text features and project directly into a common subspace, or change various loss functions on this basis, or use the attention mechanism to directly match the image region proposals and the text phrases. This is not a good match for the semantics of the image and the text. In this study, we propose a method of cross-media retrieval based on global representation and local representation. We constructed a cross-media two-level network to explore better semantic matching between images and text, which contains subnets that handle both global and local features. Specifically, we not only use the self-attention network to obtain a macro representation of the global image but also use the local fine-grained patch with the attention mechanism. Then, we use a two-level alignment framework to promote each other to learn different representations of cross-media retrieval. The innovation of this study lies in the use of more comprehensive features of image and text to design the two kinds of similarity and add them up in some way. Experimental results show that this method is effective in image-text retrieval. Experimental results on the Flickr30K and MS-COCO datasets show that this model has a better recall rate than many of the current advanced cross-media retrieval models.\",\"PeriodicalId\":416037,\"journal\":{\"name\":\"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSAA.2019.00029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2019.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Cross-Media Image-Text Retrieval Combined with Global Similarity and Local Similarity
In this paper, we study the problem of image-text matching in order to make the image and text have better semantic matching. In the previous work, people just simply used the pre-training network to extract image and text features and project directly into a common subspace, or change various loss functions on this basis, or use the attention mechanism to directly match the image region proposals and the text phrases. This is not a good match for the semantics of the image and the text. In this study, we propose a method of cross-media retrieval based on global representation and local representation. We constructed a cross-media two-level network to explore better semantic matching between images and text, which contains subnets that handle both global and local features. Specifically, we not only use the self-attention network to obtain a macro representation of the global image but also use the local fine-grained patch with the attention mechanism. Then, we use a two-level alignment framework to promote each other to learn different representations of cross-media retrieval. The innovation of this study lies in the use of more comprehensive features of image and text to design the two kinds of similarity and add them up in some way. Experimental results show that this method is effective in image-text retrieval. Experimental results on the Flickr30K and MS-COCO datasets show that this model has a better recall rate than many of the current advanced cross-media retrieval models.