In recent years, recommendation systems have attracted more and more attention due to the rapid development of e-commerce. Reviews information can offer help in modeling user's preference and item's performance. Some existing methods utilize reviews for the recommendation. However, few of those models consider the importance of reviews and words in corpus together. Therefore, we propose an approach for rating prediction using a hierarchical attention-based network named HANN, which can distinguish the importance of reviews at both word level and review level for explanations automatically. Experiments on four real-life datasets from Amazon demonstrate that our model achieves an improvement in prediction compared to several state-of-the-art approaches. The hierarchical attention weights in sampled test data verify the effect on selecting informative words and reviews.
{"title":"Hierarchical Attention based Neural Network for Explainable Recommendation","authors":"Dawei Cong, Yanyan Zhao, Bing Qin, Yu Han, Murray Zhang, Alden Liu, Nat Chen","doi":"10.1145/3323873.3326592","DOIUrl":"https://doi.org/10.1145/3323873.3326592","url":null,"abstract":"In recent years, recommendation systems have attracted more and more attention due to the rapid development of e-commerce. Reviews information can offer help in modeling user's preference and item's performance. Some existing methods utilize reviews for the recommendation. However, few of those models consider the importance of reviews and words in corpus together. Therefore, we propose an approach for rating prediction using a hierarchical attention-based network named HANN, which can distinguish the importance of reviews at both word level and review level for explanations automatically. Experiments on four real-life datasets from Amazon demonstrate that our model achieves an improvement in prediction compared to several state-of-the-art approaches. The hierarchical attention weights in sampled test data verify the effect on selecting informative words and reviews.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130608129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steganography represents the art of unobtrusively concealing a secret message within some cover data. The key scope of this work is about high-capacity visual steganography techniques that hide a full-sized color video within another. We empirically validate that high-capacity image steganography model doesn't naturally extend to the video case for it completely ignores the temporal redundancy within consecutive video frames. Our work proposes a novel solution to this problem(i.e., hiding a video into another video). The technical contributions are two-fold: first, motivated by the fact that the residual between two consecutive frames is highly-sparse, we propose to explicitly consider inter-frame residuals. Specifically, our model contains two branches, one of which is specially designed for hiding inter-frame residual into a cover video frame and the other hides the original secret frame. And then two decoders are devised, revealing residual or frame respectively. Secondly, we develop the model based on deep convolutional neural networks, which is the first of its kind in the literature of video steganography. In experiments, comprehensive evaluations are conducted to compare our model with classic steganography methods and pure high-capacity image steganography models. All results strongly suggest that the proposed model enjoys advantages over previous methods. We also carefully investigate our model's security to steganalyzer and the robustness to video compression.
{"title":"High-Capacity Convolutional Video Steganography with Temporal Residual Modeling","authors":"Xinyu Weng, Yongzhi Li, Lu Chi, Yadong Mu","doi":"10.1145/3323873.3325011","DOIUrl":"https://doi.org/10.1145/3323873.3325011","url":null,"abstract":"Steganography represents the art of unobtrusively concealing a secret message within some cover data. The key scope of this work is about high-capacity visual steganography techniques that hide a full-sized color video within another. We empirically validate that high-capacity image steganography model doesn't naturally extend to the video case for it completely ignores the temporal redundancy within consecutive video frames. Our work proposes a novel solution to this problem(i.e., hiding a video into another video). The technical contributions are two-fold: first, motivated by the fact that the residual between two consecutive frames is highly-sparse, we propose to explicitly consider inter-frame residuals. Specifically, our model contains two branches, one of which is specially designed for hiding inter-frame residual into a cover video frame and the other hides the original secret frame. And then two decoders are devised, revealing residual or frame respectively. Secondly, we develop the model based on deep convolutional neural networks, which is the first of its kind in the literature of video steganography. In experiments, comprehensive evaluations are conducted to compare our model with classic steganography methods and pure high-capacity image steganography models. All results strongly suggest that the proposed model enjoys advantages over previous methods. We also carefully investigate our model's security to steganalyzer and the robustness to video compression.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128648663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jakub Lokoč, Klaus Schöffmann, W. Bailer, Luca Rossetto, C. Gurrin
We present a tutorial focusing on video retrieval tasks, where state-of-the-art deep learning approaches still benefit from interactive decisions of users. The tutorial covers general introduction to the interactive video retrieval research area, state-of-the-art video retrieval systems, evaluation campaigns and recently observed results. Moreover, a significant part of the tutorial is dedicated to a practical exercise with three selected state-of-the-art systems in the form of an interactive video retrieval competition. Participants of this tutorial will gain a practical experience and also a general insight of the interactive video retrieval topic, which is a good start to focus their research on unsolved challenges in this area.
{"title":"Interactive Video Retrieval in the Age of Deep Learning","authors":"Jakub Lokoč, Klaus Schöffmann, W. Bailer, Luca Rossetto, C. Gurrin","doi":"10.1145/3323873.3326588","DOIUrl":"https://doi.org/10.1145/3323873.3326588","url":null,"abstract":"We present a tutorial focusing on video retrieval tasks, where state-of-the-art deep learning approaches still benefit from interactive decisions of users. The tutorial covers general introduction to the interactive video retrieval research area, state-of-the-art video retrieval systems, evaluation campaigns and recently observed results. Moreover, a significant part of the tutorial is dedicated to a practical exercise with three selected state-of-the-art systems in the form of an interactive video retrieval competition. Participants of this tutorial will gain a practical experience and also a general insight of the interactive video retrieval topic, which is a good start to focus their research on unsolved challenges in this area.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133671456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhipeng Wei, Jingjing Chen, Zhaoyan Ming, C. Ngo, Tat-Seng Chua, F. Zhou
Restaurant dishes represent a significant portion of food that people consume in their daily life. While people are becoming health-conscious in their food intake, convenient restaurant food tracking becomes an essential task in wellness and fitness applications. Given the huge number of dishes (food categories) involved, it becomes extremely challenging for traditional food photo classification to be feasible in both algorithm design and training data availability. In this work, we present a demo that runs on restaurant dish images in a city of millions of residents and tens of thousand restaurants. We propose a rank-loss based convolutional neural network to optimize the image features representation. Context information such as GPS location of the recognition request is also used to further improve the performance. Our experimental results are highly promising. We have shown in our demo that the proposed algorithm is near ready to be deployed in real-world applications.
{"title":"DietLens-Eout: Large Scale Restaurant Food Photo Recognition","authors":"Zhipeng Wei, Jingjing Chen, Zhaoyan Ming, C. Ngo, Tat-Seng Chua, F. Zhou","doi":"10.1145/3323873.3326923","DOIUrl":"https://doi.org/10.1145/3323873.3326923","url":null,"abstract":"Restaurant dishes represent a significant portion of food that people consume in their daily life. While people are becoming health-conscious in their food intake, convenient restaurant food tracking becomes an essential task in wellness and fitness applications. Given the huge number of dishes (food categories) involved, it becomes extremely challenging for traditional food photo classification to be feasible in both algorithm design and training data availability. In this work, we present a demo that runs on restaurant dish images in a city of millions of residents and tens of thousand restaurants. We propose a rank-loss based convolutional neural network to optimize the image features representation. Context information such as GPS location of the recognition request is also used to further improve the performance. Our experimental results are highly promising. We have shown in our demo that the proposed algorithm is near ready to be deployed in real-world applications.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122209711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tongcun Liu, J. Liao, Zhigen Wu, Yulong Wang, Jingyu Wang
Obtaining insight into user mobility for next point-of-interest (POI) recommendations is a vital yet challenging task in location-based social networking. Information is needed not only to estimate user preferences but to leverage sequence relationships from user check-ins. Existing approaches to understanding user mobility gloss over the check-in sequence, making it difficult to capture the subtle POI-POI connections and distinguish relevant check-ins from the irrelevant. We created a geographically-temporally awareness hierarchical attention network (GT-HAN) to resolve those issues. GT-HAN contains an extended attention network that uses a theory of geographical influence to simultaneously uncover the overall sequence dependence and the subtle POI-POI relationships. We show that the mining of subtle POI-POI relationships significantly improves the quality of next POI recommendations. A context-specific co-attention network was designed to learn changing user preferences by adaptively selecting relevant check-in activities from check-in histories, which enabled GT-HAN to distinguish degrees of user preference for different check-ins. Tests using two large-scale datasets (obtained from Foursquare and Gowalla) demonstrated the superiority of GT-HAN over existing approaches and achieved excellent results.
{"title":"A Geographical-Temporal Awareness Hierarchical Attention Network for Next Point-of-Interest Recommendation","authors":"Tongcun Liu, J. Liao, Zhigen Wu, Yulong Wang, Jingyu Wang","doi":"10.1145/3323873.3325024","DOIUrl":"https://doi.org/10.1145/3323873.3325024","url":null,"abstract":"Obtaining insight into user mobility for next point-of-interest (POI) recommendations is a vital yet challenging task in location-based social networking. Information is needed not only to estimate user preferences but to leverage sequence relationships from user check-ins. Existing approaches to understanding user mobility gloss over the check-in sequence, making it difficult to capture the subtle POI-POI connections and distinguish relevant check-ins from the irrelevant. We created a geographically-temporally awareness hierarchical attention network (GT-HAN) to resolve those issues. GT-HAN contains an extended attention network that uses a theory of geographical influence to simultaneously uncover the overall sequence dependence and the subtle POI-POI relationships. We show that the mining of subtle POI-POI relationships significantly improves the quality of next POI recommendations. A context-specific co-attention network was designed to learn changing user preferences by adaptively selecting relevant check-in activities from check-in histories, which enabled GT-HAN to distinguish degrees of user preference for different check-ins. Tests using two large-scale datasets (obtained from Foursquare and Gowalla) demonstrated the superiority of GT-HAN over existing approaches and achieved excellent results.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124717613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Jian, Ting Jia, Xun Yang, Lifang Wu, Lina Huo
With the rapid evolution of social networks, the increasing user intention gap and visual semantic gap both bring great challenge for users to access satisfied contents. It becomes promising to investigate users' customized multimedia recommendation. In this paper, we propose cross-modal collaborative manifold propagation (CMP) for image recommendation. CMP leverages users' interest distribution to propagate images' user records, which lets users know the trend from others and produces interest-aware image candidates upon users' interests. Visual distribution is investigated simultaneously to propagate users' visual records along dense semantic visual manifold. Visual manifold propagation helps to estimate semantic accurate user-image correlations for the candidate images in recommendation ranking. Experimental performance demonstrate the collaborative user-image inferring ability of CMP with effective user interest manifold propagation and semantic visual manifold propagation in personalized image recommendation.
{"title":"Cross-modal Collaborative Manifold Propagation for Image Recommendation","authors":"Meng Jian, Ting Jia, Xun Yang, Lifang Wu, Lina Huo","doi":"10.1145/3323873.3325054","DOIUrl":"https://doi.org/10.1145/3323873.3325054","url":null,"abstract":"With the rapid evolution of social networks, the increasing user intention gap and visual semantic gap both bring great challenge for users to access satisfied contents. It becomes promising to investigate users' customized multimedia recommendation. In this paper, we propose cross-modal collaborative manifold propagation (CMP) for image recommendation. CMP leverages users' interest distribution to propagate images' user records, which lets users know the trend from others and produces interest-aware image candidates upon users' interests. Visual distribution is investigated simultaneously to propagate users' visual records along dense semantic visual manifold. Visual manifold propagation helps to estimate semantic accurate user-image correlations for the candidate images in recommendation ranking. Experimental performance demonstrate the collaborative user-image inferring ability of CMP with effective user interest manifold propagation and semantic visual manifold propagation in personalized image recommendation.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127364599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Otto, Matthias Springstein, Avishek Anand, R. Ewerth
Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.
{"title":"Understanding, Categorizing and Predicting Semantic Image-Text Relations","authors":"Christian Otto, Matthias Springstein, Avishek Anand, R. Ewerth","doi":"10.1145/3323873.3325049","DOIUrl":"https://doi.org/10.1145/3323873.3325049","url":null,"abstract":"Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., \"illustration\" or \"anchorage\") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131411594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the dramatic increase of multi-media data on the Internet, cross-modal retrieval has become an important and valuable task in searching systems. The key challenge of this task is how to build the correlation between multi-modal data. Most existing approaches only focus on dealing with paired data. They use pairwise relationship of multi-modal data for exploring the correlation between them. However, in practice, unpaired data are more common on the Internet but few methods pay attention to them. To utilize both paired and unpaired data, we propose a one-stream framework triplet fusion network hashing (TFNH), which mainly consists of two parts. The first part is a triplet network which is used to handle both kinds of data, with the help of zero padding operation. The second part consists of two data classifiers, which are used to bridge the gap between paired and unpaired data. In addition, we embed manifold learning into the framework for preserving both inter and intra modal similarity, exploring the relationship between unpaired and paired data and bridging the gap between them in learning process. Extensive experiments show that the proposed approach outperforms several state-of-the-art methods on two datasets in paired scenario. We further evaluate its ability of handling unpaired scenario and robustness in regard to pairwise constraint. The results show that even we discard 50% data under the setting in [19], the performance of TFNH is still better than that of other unpaired approaches and that only 70% pairwise relationships are preserved, TFNH can still outperform almost all paired approaches.
{"title":"Triplet Fusion Network Hashing for Unpaired Cross-Modal Retrieval","authors":"Zhikai Hu, Xin Liu, Xingzhi Wang, Yiu-ming Cheung, N. Wang, Yewang Chen","doi":"10.1145/3323873.3325041","DOIUrl":"https://doi.org/10.1145/3323873.3325041","url":null,"abstract":"With the dramatic increase of multi-media data on the Internet, cross-modal retrieval has become an important and valuable task in searching systems. The key challenge of this task is how to build the correlation between multi-modal data. Most existing approaches only focus on dealing with paired data. They use pairwise relationship of multi-modal data for exploring the correlation between them. However, in practice, unpaired data are more common on the Internet but few methods pay attention to them. To utilize both paired and unpaired data, we propose a one-stream framework triplet fusion network hashing (TFNH), which mainly consists of two parts. The first part is a triplet network which is used to handle both kinds of data, with the help of zero padding operation. The second part consists of two data classifiers, which are used to bridge the gap between paired and unpaired data. In addition, we embed manifold learning into the framework for preserving both inter and intra modal similarity, exploring the relationship between unpaired and paired data and bridging the gap between them in learning process. Extensive experiments show that the proposed approach outperforms several state-of-the-art methods on two datasets in paired scenario. We further evaluate its ability of handling unpaired scenario and robustness in regard to pairwise constraint. The results show that even we discard 50% data under the setting in [19], the performance of TFNH is still better than that of other unpaired approaches and that only 70% pairwise relationships are preserved, TFNH can still outperform almost all paired approaches.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121459391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian Berns, Luca Rossetto, Klaus Schöffmann, C. Beecks, G. Awad
In this work we analyze content statistics of the V3C1 dataset, which is the first partition of theVimeo Creative Commons Collection (V3C). The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content characteristics, and will serve as evaluation basis for the Video Browser Showdown 2019-2021 and TREC Video Retrieval (TRECVID) Ad-Hoc Video Search tasks 2019-2021. The dataset comes with a shot segmentation (around 1 million shots) for which we analyze content specifics and statistics. Our research shows that the content of V3C1 is very diverse, has no predominant characteristics and provides a low self-similarity. Thus it is very well suited for video retrieval evaluations as well as for participants of TRECVID AVS or the VBS.
{"title":"V3C1 Dataset: An Evaluation of Content Characteristics","authors":"Fabian Berns, Luca Rossetto, Klaus Schöffmann, C. Beecks, G. Awad","doi":"10.1145/3323873.3325051","DOIUrl":"https://doi.org/10.1145/3323873.3325051","url":null,"abstract":"In this work we analyze content statistics of the V3C1 dataset, which is the first partition of theVimeo Creative Commons Collection (V3C). The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content characteristics, and will serve as evaluation basis for the Video Browser Showdown 2019-2021 and TREC Video Retrieval (TRECVID) Ad-Hoc Video Search tasks 2019-2021. The dataset comes with a shot segmentation (around 1 million shots) for which we analyze content specifics and statistics. Our research shows that the content of V3C1 is very diverse, has no predominant characteristics and provides a low self-similarity. Thus it is very well suited for video retrieval evaluations as well as for participants of TRECVID AVS or the VBS.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121610501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Discriminative local features obtained from activations of convolutional neural networks have proven to be essential for image retrieval. To improve retrieval performance, many recent works aim to obtain more powerful and discriminative features. In this work, we propose a new attention layer to assess the importance of local features and assign higher weights to those more discriminative. Furthermore, we present a scale and mask module to filter out the meaningless local features and scale the major components. This module not only reduces the impact of the various scales of the major components in images by scaling them on the feature maps, but also filters out the redundant and confusing features with the MAX-Mask. Finally, the features are aggregated into the image representation. Experimental evaluations demonstrate that the proposed method outperforms the state-of-the-art methods on standard image retrieval datasets.
{"title":"Learning Discriminative Features for Image Retrieval","authors":"Yinghao Wang, Chen Chen, Jiong Wang, Yingying Zhu","doi":"10.1145/3323873.3325032","DOIUrl":"https://doi.org/10.1145/3323873.3325032","url":null,"abstract":"Discriminative local features obtained from activations of convolutional neural networks have proven to be essential for image retrieval. To improve retrieval performance, many recent works aim to obtain more powerful and discriminative features. In this work, we propose a new attention layer to assess the importance of local features and assign higher weights to those more discriminative. Furthermore, we present a scale and mask module to filter out the meaningless local features and scale the major components. This module not only reduces the impact of the various scales of the major components in images by scaling them on the feature maps, but also filters out the redundant and confusing features with the MAX-Mask. Finally, the features are aggregated into the image representation. Experimental evaluations demonstrate that the proposed method outperforms the state-of-the-art methods on standard image retrieval datasets.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126136569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}