跨模态检索的自监督视觉表示

Proceedings of the 2019 on International Conference on Multimedia Retrieval Pub Date : 2019-01-31 DOI:10.1145/3323873.3325035

Yash J. Patel, Lluís Gómez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar

{"title":"跨模态检索的自监督视觉表示","authors":"Yash J. Patel, Lluís Gómez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar","doi":"10.1145/3323873.3325035","DOIUrl":null,"url":null,"abstract":"Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration, and (2) the semantic context of its caption. Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like classification, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Self-Supervised Visual Representations for Cross-Modal Retrieval\",\"authors\":\"Yash J. Patel, Lluís Gómez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar\",\"doi\":\"10.1145/3323873.3325035\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration, and (2) the semantic context of its caption. Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like classification, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.\",\"PeriodicalId\":149041,\"journal\":{\"name\":\"Proceedings of the 2019 on International Conference on Multimedia Retrieval\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 on International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3323873.3325035\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3323873.3325035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

近年来，随着深度神经网络和大规模标注数据集(如ImageNet和Places)的使用，跨模态检索方法得到了显著改进。然而，收集和注释这些数据集需要大量的人力，此外，它们的注释仅限于流行的视觉类的离散集，这些类可能无法代表大规模跨模态检索数据集上发现的更丰富的语义。在本文中，我们提出了一个自监督跨模态检索框架，该框架利用整个维基百科文章集上图像和文本之间的相关性作为训练数据。我们的方法包括训练CNN来预测:(1)图片更有可能作为插图出现的文章的语义上下文，(2)标题的语义上下文。我们的实验表明，所提出的方法不仅能够学习判别视觉表示来解决分类等视觉任务，而且与在ImageNet数据集上对网络进行监督预训练相比，所学习的表示更适合跨模态检索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Self-Supervised Visual Representations for Cross-Modal Retrieval

Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration, and (2) the semantic context of its caption. Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like classification, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2019 on International Conference on Multimedia Retrieval

自引率

0.00%

发文量

期刊最新文献

EAGER Multimodal Multimedia Retrieval with vitrivr RobustiQ: A Robust ANN Search Method for Billion-scale Similarity Search on GPUs Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks DeepMarks: A Secure Fingerprinting Framework for Digital Rights Management of Deep Learning Models