TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval

Yongbiao Chen, Shenmin Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, Zhengwei Qi Shanghai Jiao Tong University, U. California, W. University
{"title":"TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval","authors":"Yongbiao Chen, Shenmin Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, Zhengwei Qi Shanghai Jiao Tong University, U. California, W. University","doi":"10.1145/3512527.3531405","DOIUrl":null,"url":null,"abstract":"Deep hashing has gained growing popularity in approximate nearest neighbor search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. Resnet [22]. In this paper, inspired by the recent advancements of vision transformers, we present Transhash, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based onVision Transformer (ViT), we design a siamese Multi-Granular Vision Tansformer backbone (MGVT) for image feature extraction. To learn fine-grained features, we innovate a dual-stream multi-granular feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner. To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (CNNs). We perform comprehensive experiments on three widely-studied datasets: CIFAR-10, NUSWIDE and IMAGENET. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2%, 2.6%, 12.7% performance gains in terms of average mAP for different hash bit lengths on three public datasets, respectively.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531405","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

Abstract

Deep hashing has gained growing popularity in approximate nearest neighbor search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. Resnet [22]. In this paper, inspired by the recent advancements of vision transformers, we present Transhash, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based onVision Transformer (ViT), we design a siamese Multi-Granular Vision Tansformer backbone (MGVT) for image feature extraction. To learn fine-grained features, we innovate a dual-stream multi-granular feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner. To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (CNNs). We perform comprehensive experiments on three widely-studied datasets: CIFAR-10, NUSWIDE and IMAGENET. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2%, 2.6%, 12.7% performance gains in terms of average mAP for different hash bit lengths on three public datasets, respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TransHash:基于变换的汉明哈希高效图像检索
深度哈希在大规模图像检索的近似近邻搜索中越来越受欢迎。到目前为止,图像检索社区的深度哈希一直由卷积神经网络架构主导,例如Resnet[22]。在本文中,受视觉转换器的最新进展的启发,我们提出了Transhash,一个纯粹基于转换器的深度哈希学习框架。具体来说,我们的框架由两个主要模块组成:(1)基于Vision Transformer (ViT),我们设计了一个用于图像特征提取的siamese Multi-Granular Vision Transformer backbone (MGVT)。为了学习细粒度特征,我们在变压器的基础上创新了双流多粒度特征学习,以学习判别性的全局和局部特征。(2)采用动态构造相似矩阵的贝叶斯学习方案学习紧凑二进制哈希码。整个框架以端到端的方式进行联合训练。据我们所知,这是第一个不使用卷积神经网络(cnn)来解决深度哈希学习问题的工作。我们在三个广泛研究的数据集上进行了全面的实验:CIFAR-10, NUSWIDE和IMAGENET。实验证明了我们比现有的最先进的深度哈希方法的优越性。具体来说,我们在三个公共数据集上对不同哈希位长度的平均mAP分别实现了8.2%、2.6%和12.7%的性能提升。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition Revisiting Performance Measures for Cross-Modal Hashing MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1