CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI:10.1145/3512527.3531381

Yaoxin Zhuo, Yikang Li, Jenhao Hsiao, C. Ho, Baoxin Li

{"title":"CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval","authors":"Yaoxin Zhuo, Yikang Li, Jenhao Hsiao, C. Ho, Baoxin Li","doi":"10.1145/3512527.3531381","DOIUrl":null,"url":null,"abstract":"With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CLIP4Hashing:用于跨模态视频文本检索的无监督深度哈希

随着网络上多媒体数据的不断增加，跨模式视频文本检索近年来受到了广泛的关注。深度跨模态哈希方法利用汉明空间实现快速检索。然而，大多数现有算法在寻找或构建定义良好的联合语义空间方面存在困难。本文提出了一种无监督深度跨模态视频文本哈希方法(CLIP4Hashing)，该方法通过使用预训练的CLIP模型构建单个哈希网络，减轻了在汉明空间中不同模态之间桥接的困难。该方法通过动态加权策略和最小-最大哈希层设计两种新技术得到增强，这两种技术被发现是性能提升的主要来源。与传统的深度跨模态哈希算法相比，clip4hash不需要特定于数据的超参数。通过使用三个具有挑战性的视频文本基准数据集进行评估，我们证明了CLIP4Hashing能够显著优于现有的最先进的哈希算法。此外，对于更大的位大小(例如2048位)，CLIP4Hashing甚至可以提供与基于非哈希特性的结果相比具有竞争力的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量