Detecting relations among objects is a crucial task for image understanding. However, each relationship involves different objects pair combinations, and different objects pair combinations express diverse interactions. This makes the relationships, based just on visual features, a challenging task. In this paper, we propose a simple yet effective relationship detection model, which is based on object semantic inference and attention mechanisms. Our model is trained to detect relation triples, such as , . To overcome the high diversity of visual appearances, the semantic inference module and the visual features are combined to complement each others. We also introduce two different attention mechanisms for object feature refinement and phrase feature refinement. In order to derive a more detailed and comprehensive representation for each object, the object feature refinement module refines the representation of each object by querying over all the other objects in the image. The phrase feature refinement module is proposed in order to make the phrase feature more effective, and to automatically focus on relative parts, to improve the visual relationship detection task. We validate our model on Visual Genome Relationship dataset. Our proposed model achieves competitive results compared to the state-of-the-art method MOTIFNET.
{"title":"Relationship Detection Based on Object Semantic Inference and Attention Mechanisms","authors":"Liang Zhang, Shuai Zhang, Peiyi Shen, Guangming Zhu, Syed Afaq Ali Shah, Bennamoun","doi":"10.1145/3323873.3325025","DOIUrl":"https://doi.org/10.1145/3323873.3325025","url":null,"abstract":"Detecting relations among objects is a crucial task for image understanding. However, each relationship involves different objects pair combinations, and different objects pair combinations express diverse interactions. This makes the relationships, based just on visual features, a challenging task. In this paper, we propose a simple yet effective relationship detection model, which is based on object semantic inference and attention mechanisms. Our model is trained to detect relation triples, such as , . To overcome the high diversity of visual appearances, the semantic inference module and the visual features are combined to complement each others. We also introduce two different attention mechanisms for object feature refinement and phrase feature refinement. In order to derive a more detailed and comprehensive representation for each object, the object feature refinement module refines the representation of each object by querying over all the other objects in the image. The phrase feature refinement module is proposed in order to make the phrase feature more effective, and to automatically focus on relative parts, to improve the visual relationship detection task. We validate our model on Visual Genome Relationship dataset. Our proposed model achieves competitive results compared to the state-of-the-art method MOTIFNET.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125771210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The region of interest (ROI) of medical images in content-based image retrieval (CBIR) systems often require content authentication and verification. This is because adversarial modification of the stored image could have lethal effect on research, diagnostic outcome and the outcome of some forensic investigations. In this work, both robust watermarking and Fragile Steganography were combined with image search features to design a medical image retrieval system that incorporates ROI integrity verification. Original ROI features were pre-computed and embedded into archival images and utilised during retrieval for image integrity checks. The average global image PSNR was 38.36dB while the ROI PSNR was maintained at an average of 46dB with all watermark search features retrieved at zero bit error rate (BER) provided the attack on the image is not perceptible.
{"title":"Integrity Verification in Medical Image Retrieval Systems using Spread Spectrum Steganography","authors":"P. Eze, P. Udaya, R. Evans, Dongxi Liu","doi":"10.1145/3323873.3325020","DOIUrl":"https://doi.org/10.1145/3323873.3325020","url":null,"abstract":"The region of interest (ROI) of medical images in content-based image retrieval (CBIR) systems often require content authentication and verification. This is because adversarial modification of the stored image could have lethal effect on research, diagnostic outcome and the outcome of some forensic investigations. In this work, both robust watermarking and Fragile Steganography were combined with image search features to design a medical image retrieval system that incorporates ROI integrity verification. Original ROI features were pre-computed and embedded into archival images and utilised during retrieval for image integrity checks. The average global image PSNR was 38.36dB while the ROI PSNR was maintained at an average of 46dB with all watermark search features retrieved at zero bit error rate (BER) provided the attack on the image is not perceptible.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116823569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic art analysis aims to classify and retrieve artistic representations from a collection of images by using computer vision and machine learning techniques. In this work, we propose to enhance visual representations from neural networks with contextual artistic information. Whereas visual representations are able to capture information about the content and the style of an artwork, our proposed context-aware embeddings additionally encode relationships between different artistic attributes, such as author, school, or historical period. We design two different approaches for using context in automatic art analysis. In the first one, contextual data is obtained through a multi-task learning model, in which several attributes are trained together to find visual relationships between elements. In the second approach, context is obtained through an art-specific knowledge graph, which encodes relationships between artistic attributes. An exhaustive evaluation of both of our models in several art analysis problems, such as author identification, type classification, or cross-modal retrieval, show that performance is improved by up to 7.3% in art classification and 37.24% in retrieval when context-aware embeddings are used.
{"title":"Context-Aware Embeddings for Automatic Art Analysis","authors":"Noa García, B. Renoust, Yuta Nakashima","doi":"10.1145/3323873.3325028","DOIUrl":"https://doi.org/10.1145/3323873.3325028","url":null,"abstract":"Automatic art analysis aims to classify and retrieve artistic representations from a collection of images by using computer vision and machine learning techniques. In this work, we propose to enhance visual representations from neural networks with contextual artistic information. Whereas visual representations are able to capture information about the content and the style of an artwork, our proposed context-aware embeddings additionally encode relationships between different artistic attributes, such as author, school, or historical period. We design two different approaches for using context in automatic art analysis. In the first one, contextual data is obtained through a multi-task learning model, in which several attributes are trained together to find visual relationships between elements. In the second approach, context is obtained through an art-specific knowledge graph, which encodes relationships between artistic attributes. An exhaustive evaluation of both of our models in several art analysis problems, such as author identification, type classification, or cross-modal retrieval, show that performance is improved by up to 7.3% in art classification and 37.24% in retrieval when context-aware embeddings are used.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133417345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimedia applications often require concurrent solutions to multiple tasks. These tasks hold clues to each-others solutions, however as these relations can be complex this remains a rarely utilized property. When task relations are explicitly defined based on domain knowledge multi-task learning (MTL) offers such concurrent solutions, while exploiting relatedness between multiple tasks performed over the same dataset. In most cases however, this relatedness is not explicitly defined and the domain expert knowledge that defines it is not available. To address this issue, we introduce Selective Sharing, a method that learns the inter-task relatedness from secondary latent features while the model trains. Using this insight, we can automatically group tasks and allow them to share knowledge in a mutually beneficial way. We support our method with experiments on 5 datasets in classification, regression, and ranking tasks and compare to strong baselines and state-of-the-art approaches showing a consistent improvement in terms of accuracy and parameter counts. In addition, we perform an activation region analysis showing how Selective Sharing affects the learned representation.
{"title":"Learning Task Relatedness in Multi-Task Learning for Images in Context","authors":"Gjorgji Strezoski, N. V. Noord, M. Worring","doi":"10.1145/3323873.3325009","DOIUrl":"https://doi.org/10.1145/3323873.3325009","url":null,"abstract":"Multimedia applications often require concurrent solutions to multiple tasks. These tasks hold clues to each-others solutions, however as these relations can be complex this remains a rarely utilized property. When task relations are explicitly defined based on domain knowledge multi-task learning (MTL) offers such concurrent solutions, while exploiting relatedness between multiple tasks performed over the same dataset. In most cases however, this relatedness is not explicitly defined and the domain expert knowledge that defines it is not available. To address this issue, we introduce Selective Sharing, a method that learns the inter-task relatedness from secondary latent features while the model trains. Using this insight, we can automatically group tasks and allow them to share knowledge in a mutually beneficial way. We support our method with experiments on 5 datasets in classification, regression, and ranking tasks and compare to strong baselines and state-of-the-art approaches showing a consistent improvement in terms of accuracy and parameter counts. In addition, we perform an activation region analysis showing how Selective Sharing affects the learned representation.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125372313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifan Yang, Libing Geng, Hanjiang Lai, Yan Pan, Jian Yin
In recent years, deep-networks-based hashing has become a leading approach for large-scale image retrieval. Most deep hashing approaches use the high layer to extract the powerful semantic representations. However, these methods have limited ability for fine-grained image retrieval because the semantic features extracted from the high layer are difficult in capturing the subtle differences. To this end, we propose a novel two-pyramid hashing architecture to learn both the semantic information and the subtle appearance details for fine-grained image search. Inspired by the feature pyramids of convolutional neural network, avertical pyramid is proposed to capture the high-layer features and ahorizontal pyramid combines multiple low-layer features with structural information to capture the subtle differences. To fuse the low-level features, a novel combination strategy, called consensus fusion, is proposed to capture all subtle information from several low-layers for finer retrieval. Extensive evaluation on two fine-grained datasets CUB-200-2011 and Stanford Dogs demonstrate that the proposed method achieves significant performance compared with the state-of-art baselines.
{"title":"Feature Pyramid Hashing","authors":"Yifan Yang, Libing Geng, Hanjiang Lai, Yan Pan, Jian Yin","doi":"10.1145/3323873.3325015","DOIUrl":"https://doi.org/10.1145/3323873.3325015","url":null,"abstract":"In recent years, deep-networks-based hashing has become a leading approach for large-scale image retrieval. Most deep hashing approaches use the high layer to extract the powerful semantic representations. However, these methods have limited ability for fine-grained image retrieval because the semantic features extracted from the high layer are difficult in capturing the subtle differences. To this end, we propose a novel two-pyramid hashing architecture to learn both the semantic information and the subtle appearance details for fine-grained image search. Inspired by the feature pyramids of convolutional neural network, avertical pyramid is proposed to capture the high-layer features and ahorizontal pyramid combines multiple low-layer features with structural information to capture the subtle differences. To fuse the low-level features, a novel combination strategy, called consensus fusion, is proposed to capture all subtle information from several low-layers for finer retrieval. Extensive evaluation on two fine-grained datasets CUB-200-2011 and Stanford Dogs demonstrate that the proposed method achieves significant performance compared with the state-of-art baselines.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122146673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep-networks-based hashing has become a leading approach for large-scale image retrieval, which learns a similarity-preserving network to map similar images to nearby hash codes. The pairwise and triplet losses are two widely used similarity preserving manners for deep hashing. These manners ignore the fact that hashing is a prediction task on the list of binary codes. However, learning deep hashing with listwise supervision is challenging in 1) how to obtain the rank list of whole training set when the batch size of the deep network is always small and 2) how to utilize the listwise supervision. In this paper, we present a novel deep policy hashing architecture with two systems are learned in parallel: aquery network and a shared and slowly changingdatabase network. The following three steps are repeated until convergence: 1) the database network encodes all training samples into binary codes to obtain whole rank list, 2) the query network is trained based on policy learning to maximize a reward that indicates the performance of the whole ranking list of binary codes, e.g., mean average precision (MAP), and 3) the database network is updated as the query network. Extensive evaluations on several benchmark datasets show that the proposed method brings substantial improvements over state-of-the-art hashing methods.
基于深度网络的哈希已经成为大规模图像检索的主要方法,它学习一个保持相似性的网络来将相似的图像映射到附近的哈希码。双损失和三重损失是深度哈希中广泛使用的两种相似度保持方法。这些方式忽略了一个事实,即散列是对二进制代码列表的预测任务。然而,利用列表监督学习深度哈希存在以下挑战:1)当深度网络的批处理规模总是很小时,如何获得整个训练集的秩表;2)如何利用列表监督。在本文中,我们提出了一种新的深度策略哈希架构,该架构使用两个并行学习的系统:查询网络和共享且缓慢变化的数据库网络。重复以下三个步骤直到收敛:1)数据库网络将所有训练样本编码为二进制代码以获得整个排名表,2)基于策略学习训练查询网络以最大化表示整个二进制代码排名表性能的奖励,例如mean average precision (MAP), 3)数据库网络更新为查询网络。对几个基准数据集的广泛评估表明,所提出的方法比最先进的哈希方法带来了实质性的改进。
{"title":"Deep Policy Hashing Network with Listwise Supervision","authors":"Shaoying Wang, Hai-Cheng Lai, Yifan Yang, Jian Yin","doi":"10.1145/3323873.3325016","DOIUrl":"https://doi.org/10.1145/3323873.3325016","url":null,"abstract":"Deep-networks-based hashing has become a leading approach for large-scale image retrieval, which learns a similarity-preserving network to map similar images to nearby hash codes. The pairwise and triplet losses are two widely used similarity preserving manners for deep hashing. These manners ignore the fact that hashing is a prediction task on the list of binary codes. However, learning deep hashing with listwise supervision is challenging in 1) how to obtain the rank list of whole training set when the batch size of the deep network is always small and 2) how to utilize the listwise supervision. In this paper, we present a novel deep policy hashing architecture with two systems are learned in parallel: aquery network and a shared and slowly changingdatabase network. The following three steps are repeated until convergence: 1) the database network encodes all training samples into binary codes to obtain whole rank list, 2) the query network is trained based on policy learning to maximize a reward that indicates the performance of the whole ranking list of binary codes, e.g., mean average precision (MAP), and 3) the database network is updated as the query network. Extensive evaluations on several benchmark datasets show that the proposed method brings substantial improvements over state-of-the-art hashing methods.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127358641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose an unsupervised hashing method, exploiting a shallow neural network, that aims to produce binary codes that preserve the ranking induced by an original real-valued representation. This is motivated by the emergence of small-world graph-based approximate search methods that rely on local neighborhood ranking. We formalize the training process in an intuitive way by considering each training sample as a query and aiming to obtain a ranking of a random subset of the training set using the hash codes that is the same as the ranking using the original features. We also explore the use of a decoder to obtain an approximated reconstruction of the original features. At test time, we retrieve the most promising database samples using only the hash codes and perform re-ranking using the reconstructed features, thus allowing the complete elimination of the original real-valued features and the associated high memory cost. Experiments conducted on publicly available large-scale datasets show that our method consistently outperforms all compared state-of-the-art unsupervised hashing methods and that the reconstruction procedure can effectively boost the search accuracy with a minimal constant additional cost.
{"title":"Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval","authors":"Svebor Karaman, Xudong Lin, Xuefeng Hu, Shih-Fu Chang","doi":"10.1145/3323873.3325038","DOIUrl":"https://doi.org/10.1145/3323873.3325038","url":null,"abstract":"We propose an unsupervised hashing method, exploiting a shallow neural network, that aims to produce binary codes that preserve the ranking induced by an original real-valued representation. This is motivated by the emergence of small-world graph-based approximate search methods that rely on local neighborhood ranking. We formalize the training process in an intuitive way by considering each training sample as a query and aiming to obtain a ranking of a random subset of the training set using the hash codes that is the same as the ranking using the original features. We also explore the use of a decoder to obtain an approximated reconstruction of the original features. At test time, we retrieve the most promising database samples using only the hash codes and perform re-ranking using the reconstructed features, thus allowing the complete elimination of the original real-valued features and the associated high memory cost. Experiments conducted on publicly available large-scale datasets show that our method consistently outperforms all compared state-of-the-art unsupervised hashing methods and that the reconstruction procedure can effectively boost the search accuracy with a minimal constant additional cost.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"06 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128926054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, cross-modal deep hashing has received broad attention for solving cross-modal retrieval problems efficiently. Most cross-modal hashing methods generate $O(n^2)$ data pairs and $O(n^3)$ data triplets for training, but the training procedure is less efficient because the complexity is high for large-scale dataset. In this paper, we propose a novel and efficient cross-modal hashing algorithm named Joint Cluster Cross-Modal Hashing (JCCH). First, We introduce the Cross-Modal Unary Loss (CMUL) with $O(n)$ complexity to bridge the traditional triplet loss and classification-based unary loss, and the JCCH algorithm is introduced with CMUL. Second, a more accurate bound of the triplet loss for structured multilabel data is introduced in CMUL. The resultant hashcodes form several clusters in which the hashcodes in the same cluster share similar semantic information, and the heterogeneity gap on different modalities is diminished by sharing the clusters. Experiments on large-scale datasets show that the proposed method is superior over or comparable with state-of-the-art cross-modal hashing methods, and training with the proposed method is more efficient than others.
{"title":"Joint Cluster Unary Loss for Efficient Cross-Modal Hashing","authors":"Shifeng Zhang, Jianmin Li, Bo Zhang","doi":"10.1145/3323873.3325059","DOIUrl":"https://doi.org/10.1145/3323873.3325059","url":null,"abstract":"Recently, cross-modal deep hashing has received broad attention for solving cross-modal retrieval problems efficiently. Most cross-modal hashing methods generate $O(n^2)$ data pairs and $O(n^3)$ data triplets for training, but the training procedure is less efficient because the complexity is high for large-scale dataset. In this paper, we propose a novel and efficient cross-modal hashing algorithm named Joint Cluster Cross-Modal Hashing (JCCH). First, We introduce the Cross-Modal Unary Loss (CMUL) with $O(n)$ complexity to bridge the traditional triplet loss and classification-based unary loss, and the JCCH algorithm is introduced with CMUL. Second, a more accurate bound of the triplet loss for structured multilabel data is introduced in CMUL. The resultant hashcodes form several clusters in which the hashcodes in the same cluster share similar semantic information, and the heterogeneity gap on different modalities is diminished by sharing the clusters. Experiments on large-scale datasets show that the proposed method is superior over or comparable with state-of-the-art cross-modal hashing methods, and training with the proposed method is more efficient than others.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114291153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yash J. Patel, Lluís Gómez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar
Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration, and (2) the semantic context of its caption. Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like classification, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.
{"title":"Self-Supervised Visual Representations for Cross-Modal Retrieval","authors":"Yash J. Patel, Lluís Gómez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar","doi":"10.1145/3323873.3325035","DOIUrl":"https://doi.org/10.1145/3323873.3325035","url":null,"abstract":"Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration, and (2) the semantic context of its caption. Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like classification, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129827550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An adversarial query is an image that has been modified to disrupt content-based image retrieval (CBIR), while appearing nearly untouched to the human eye. This paper presents an analysis of adversarial queries for CBIR based on neural, local, and global features. We introduce an innovative neural image perturbation approach, called Perturbations for Image Retrieval Error (PIRE), that is capable of blocking neural-feature-based CBIR. PIRE differs significantly from existing approaches that create images adversarial with respect to CNN classifiers because it is unsupervised, i.e., it needs no labeled data from the data set to which it is applied. Our experimental analysis demonstrates the surprising effectiveness of PIRE in blocking CBIR, and also covers aspects of PIRE that must be taken into account in practical settings, including saving images, image quality and leaking adversarial queries into the background collection. Our experiments also compare PIRE (a neural approach) with existing keypoint removal and injection approaches (which modify local features). Finally, we discuss the challenges that face multimedia researchers in the future study of adversarial queries.
{"title":"Who's Afraid of Adversarial Queries?: The Impact of Image Modifications on Content-based Image Retrieval","authors":"Zhuoran Liu, Zhengyu Zhao, M. Larson","doi":"10.1145/3323873.3325052","DOIUrl":"https://doi.org/10.1145/3323873.3325052","url":null,"abstract":"An adversarial query is an image that has been modified to disrupt content-based image retrieval (CBIR), while appearing nearly untouched to the human eye. This paper presents an analysis of adversarial queries for CBIR based on neural, local, and global features. We introduce an innovative neural image perturbation approach, called Perturbations for Image Retrieval Error (PIRE), that is capable of blocking neural-feature-based CBIR. PIRE differs significantly from existing approaches that create images adversarial with respect to CNN classifiers because it is unsupervised, i.e., it needs no labeled data from the data set to which it is applied. Our experimental analysis demonstrates the surprising effectiveness of PIRE in blocking CBIR, and also covers aspects of PIRE that must be taken into account in practical settings, including saving images, image quality and leaking adversarial queries into the background collection. Our experiments also compare PIRE (a neural approach) with existing keypoint removal and injection approaches (which modify local features). Finally, we discuss the challenges that face multimedia researchers in the future study of adversarial queries.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126159623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}