Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2020-07-25 DOI:10.1145/3397271.3401149

Xing Xu, Kaiyi Lin, Huimin Lu, Lianli Gao, Heng Tao Shen

{"title":"Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval","authors":"Xing Xu, Kaiyi Lin, Huimin Lu, Lianli Gao, Heng Tao Shen","doi":"10.1145/3397271.3401149","DOIUrl":null,"url":null,"abstract":"The goal of cross-modal retrieval is to search for semantically similar instances in one modality by using a query from another modality. Existing approaches mainly consider the standard scenario that requires the source set for training and the target set for testing share the same scope of classes. However, they may not generalize well on zero-shot cross-modal retrieval (ZS-CMR) task, where the target set contains unseen classes that are disjoint with the seen classes in the source set. This task is more challenging due to 1) the absence of the unseen classes during training, 2) inconsistent semantics across seen and unseen classes, and 3) the heterogeneous multimodal distributions between the source and target set. To address these issues, we propose a novel Correlated Feature Synthesis and Alignment (CFSA) approach to integrate multimodal feature synthesis, common space learning and knowledge transfer for ZS-CMR. Our CFSA first utilizes class-level word embeddings to guide two coupled Wassertein generative adversarial networks (WGANs) to synthesize sufficient multimodal features with semantic correlation for stable training. Then the synthetic and true multimodal features are jointly mapped to a common semantic space via an effective distribution alignment scheme, where the cross-modal correlations of different semantic features are captured and the knowledge can be transferred to the unseen classes under the cycle-consistency constraint. Experiments on four benchmark datasets for image-text retrieval and two large-scale datasets for image-sketch retrieval show the remarkable improvements achieved by our CFAS method comparing with a bundle of state-of-the-art approaches.","PeriodicalId":252050,"journal":{"name":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"67 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397271.3401149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

The goal of cross-modal retrieval is to search for semantically similar instances in one modality by using a query from another modality. Existing approaches mainly consider the standard scenario that requires the source set for training and the target set for testing share the same scope of classes. However, they may not generalize well on zero-shot cross-modal retrieval (ZS-CMR) task, where the target set contains unseen classes that are disjoint with the seen classes in the source set. This task is more challenging due to 1) the absence of the unseen classes during training, 2) inconsistent semantics across seen and unseen classes, and 3) the heterogeneous multimodal distributions between the source and target set. To address these issues, we propose a novel Correlated Feature Synthesis and Alignment (CFSA) approach to integrate multimodal feature synthesis, common space learning and knowledge transfer for ZS-CMR. Our CFSA first utilizes class-level word embeddings to guide two coupled Wassertein generative adversarial networks (WGANs) to synthesize sufficient multimodal features with semantic correlation for stable training. Then the synthetic and true multimodal features are jointly mapped to a common semantic space via an effective distribution alignment scheme, where the cross-modal correlations of different semantic features are captured and the knowledge can be transferred to the unseen classes under the cycle-consistency constraint. Experiments on four benchmark datasets for image-text retrieval and two large-scale datasets for image-sketch retrieval show the remarkable improvements achieved by our CFAS method comparing with a bundle of state-of-the-art approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

零射击跨模态检索的相关特征综合与对齐

跨模态检索的目标是通过使用来自另一个模态的查询来搜索一个模态中语义相似的实例。现有的方法主要考虑标准场景，即要求用于训练的源集和用于测试的目标集共享相同的类范围。然而，它们可能不能很好地泛化零射击跨模态检索(ZS-CMR)任务，其中目标集包含与源集中可见类不相交的未见类。这项任务更具挑战性，因为1)在训练过程中没有看不见的类，2)可见类和不可见类之间的语义不一致，以及3)源和目标集之间的异构多模态分布。为了解决这些问题，我们提出了一种新的相关特征合成和对齐(CFSA)方法，将多模态特征合成、公共空间学习和知识转移集成到ZS-CMR中。我们的CFSA首先利用类级词嵌入来引导两个耦合的Wassertein生成对抗网络(wgan)合成足够的具有语义相关性的多模态特征，以实现稳定的训练。然后通过一种有效的分布对齐方案，将合成的和真实的多模态特征联合映射到一个共同的语义空间，在循环一致性约束下捕获不同语义特征的跨模态相关性，并将知识转移到不可见的类中。在四个图像文本检索基准数据集和两个图像草图检索大型数据集上的实验表明，与一系列最先进的方法相比，我们的CFAS方法取得了显着的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量

期刊最新文献

MHM: Multi-modal Clinical Data based Hierarchical Multi-label Diagnosis Prediction Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval DVGAN Models Versus Satisfaction: Towards a Better Understanding of Evaluation Metrics Global Context Enhanced Graph Neural Networks for Session-based Recommendation