Coupled dictionary learning and feature mapping for cross-modal retrieval

2015 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2015-08-06 DOI:10.1109/ICME.2015.7177396

Xing Xu, Atsushi Shimada, R. Taniguchi, Li He

{"title":"Coupled dictionary learning and feature mapping for cross-modal retrieval","authors":"Xing Xu, Atsushi Shimada, R. Taniguchi, Li He","doi":"10.1109/ICME.2015.7177396","DOIUrl":null,"url":null,"abstract":"In this paper, we investigate the problem of modeling images and associated text for cross-modal retrieval tasks such as text-to-image search and image-to-text search. To make the data from image and text modalities comparable, previous cross-modal retrieval methods directly learn two projection matrices to map the raw features of the two modalities into a common subspace, in which cross-modal data matching can be performed. However, the different feature representations and correlation structures of different modalities inhibit these methods from efficiently modeling the relationships across modalities through a common subspace. To handle the diversities of different modalities, we first leverage the coupled dictionary learning method to generate homogeneous sparse representations for different modalities by associating and jointly updating their dictionaries. We then use a coupled feature mapping scheme to project the derived sparse representations from different modalities into a common subspace in which cross-modal retrieval can be performed. Experiments on a variety of cross-modal retrieval tasks demonstrate that the proposed method outperforms the state-of-the-art approaches.","PeriodicalId":146271,"journal":{"name":"2015 IEEE International Conference on Multimedia and Expo (ICME)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2015.7177396","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

In this paper, we investigate the problem of modeling images and associated text for cross-modal retrieval tasks such as text-to-image search and image-to-text search. To make the data from image and text modalities comparable, previous cross-modal retrieval methods directly learn two projection matrices to map the raw features of the two modalities into a common subspace, in which cross-modal data matching can be performed. However, the different feature representations and correlation structures of different modalities inhibit these methods from efficiently modeling the relationships across modalities through a common subspace. To handle the diversities of different modalities, we first leverage the coupled dictionary learning method to generate homogeneous sparse representations for different modalities by associating and jointly updating their dictionaries. We then use a coupled feature mapping scheme to project the derived sparse representations from different modalities into a common subspace in which cross-modal retrieval can be performed. Experiments on a variety of cross-modal retrieval tasks demonstrate that the proposed method outperforms the state-of-the-art approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

跨模态检索的耦合字典学习和特征映射

在本文中，我们研究了跨模式检索任务(如文本到图像搜索和图像到文本搜索)中图像和相关文本的建模问题。为了使图像和文本模态数据具有可比性，以往的跨模态检索方法直接学习两个投影矩阵，将两模态的原始特征映射到一个共同的子空间中，在该子空间中进行跨模态数据匹配。然而，不同模态的特征表示和关联结构不同，使得这些方法无法通过公共子空间有效地建模模态间的关系。为了处理不同模态的多样性，我们首先利用耦合字典学习方法，通过关联和联合更新不同模态的字典来生成同质的稀疏表示。然后，我们使用耦合特征映射方案将从不同模态导出的稀疏表示投影到可以执行跨模态检索的公共子空间中。在各种跨模态检索任务上的实验表明，该方法优于目前最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量

期刊最新文献

Affect-expressive hand gestures synthesis and animation VTouch: Vision-enhanced interaction for large touch displays Egocentric hand pose estimation and distance recovery in a single RGB image A hybrid approach for retrieving diverse social images of landmarks Spatial perception reproduction of sound events based on sound property coincidences