{"title":"Local Self-Attention on Fine-grained Cross-media Retrieval","authors":"Chen Wang, Yazhou Yao, Qiong Wang, Zhenmin Tang","doi":"10.1145/3469877.3490590","DOIUrl":null,"url":null,"abstract":"Due to the heterogeneity gap, the data representation of different media is inconsistent and belongs to different feature spaces. Therefore, it is challenging to measure the fine-grained gap between them. To this end, we propose an attention space training method to learn common representations of different media data. Specifically, we utilize local self-attention layers to learn the common attention space between different media data. We propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also train a local position encoding to capture the spatial relationships between features. In this way, our proposed method can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. It also improves the fine-grained recognition performance by attaching attention to high-level semantic information. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. At the same time, our approach provides a new pipeline for fine-grained cross-media retrieval. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Multimedia Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469877.3490590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Due to the heterogeneity gap, the data representation of different media is inconsistent and belongs to different feature spaces. Therefore, it is challenging to measure the fine-grained gap between them. To this end, we propose an attention space training method to learn common representations of different media data. Specifically, we utilize local self-attention layers to learn the common attention space between different media data. We propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also train a local position encoding to capture the spatial relationships between features. In this way, our proposed method can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. It also improves the fine-grained recognition performance by attaching attention to high-level semantic information. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. At the same time, our approach provides a new pipeline for fine-grained cross-media retrieval. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.