基于金字塔变换器的三重哈希算法用于稳健的视觉地点识别

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-12-01 Epub Date: 2024-09-06 DOI:10.1016/j.cviu.2024.104167

Zhenyu Li , Pengjie Xu

{"title":"基于金字塔变换器的三重哈希算法用于稳健的视觉地点识别","authors":"Zhenyu Li , Pengjie Xu","doi":"10.1016/j.cviu.2024.104167","DOIUrl":null,"url":null,"abstract":"<div><p>Deep hashing is being used to approximate nearest neighbor search for large-scale image recognition problems. However, CNN architectures have dominated similar applications. We present a Pyramid Transformer-based Triplet Hashing architecture to handle large-scale place recognition challenges in this study, leveraging the capabilities of Vision Transformer (ViT). For feature representation, we create a Siamese Pyramid Transformer backbone. We present a multi-scale feature aggregation technique to learn discriminative features for scale-invariant features. In addition, we observe that binary codes suitable for place recognition are sub-optimal. To overcome this issue, we use a self-restraint triplet loss deep learning network to create compact hash codes, further increasing recognition accuracy. To the best of our knowledge, this is the first study to use a triplet loss deep learning network to handle the deep hashing learning problem. We do extensive experiments on four difficult place datasets: KITTI, Nordland, VPRICE, and EuRoC. The experimental findings reveal that the suggested technique performs at the cutting edge of large-scale visual place recognition challenges.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104167"},"PeriodicalIF":3.5000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pyramid transformer-based triplet hashing for robust visual place recognition\",\"authors\":\"Zhenyu Li , Pengjie Xu\",\"doi\":\"10.1016/j.cviu.2024.104167\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Deep hashing is being used to approximate nearest neighbor search for large-scale image recognition problems. However, CNN architectures have dominated similar applications. We present a Pyramid Transformer-based Triplet Hashing architecture to handle large-scale place recognition challenges in this study, leveraging the capabilities of Vision Transformer (ViT). For feature representation, we create a Siamese Pyramid Transformer backbone. We present a multi-scale feature aggregation technique to learn discriminative features for scale-invariant features. In addition, we observe that binary codes suitable for place recognition are sub-optimal. To overcome this issue, we use a self-restraint triplet loss deep learning network to create compact hash codes, further increasing recognition accuracy. To the best of our knowledge, this is the first study to use a triplet loss deep learning network to handle the deep hashing learning problem. We do extensive experiments on four difficult place datasets: KITTI, Nordland, VPRICE, and EuRoC. The experimental findings reveal that the suggested technique performs at the cutting edge of large-scale visual place recognition challenges.</p></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"249 \",\"pages\":\"Article 104167\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314224002480\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/6 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002480","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/6 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

深度散列正被用于近似近邻搜索，以解决大规模图像识别问题。然而，CNN 架构在类似应用中占主导地位。在本研究中，我们利用视觉变换器（ViT）的功能，提出了一种基于金字塔变换器的三重散列架构，以应对大规模地点识别挑战。在特征表示方面，我们创建了一个连体金字塔变换器骨干。我们提出了一种多尺度特征聚合技术，以学习尺度不变特征的判别特征。此外，我们发现适用于地点识别的二进制代码是次优的。为了克服这一问题，我们使用自约束三重损失深度学习网络来创建紧凑的哈希代码，从而进一步提高识别准确率。据我们所知，这是首个使用三重损失深度学习网络来处理深度散列学习问题的研究。我们在四个难度较大的地方数据集上进行了广泛的实验：KITTI、Nordland、VPRICE 和 EuRoC。实验结果表明，所建议的技术在大规模视觉地点识别挑战中处于领先地位。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Pyramid transformer-based triplet hashing for robust visual place recognition

Deep hashing is being used to approximate nearest neighbor search for large-scale image recognition problems. However, CNN architectures have dominated similar applications. We present a Pyramid Transformer-based Triplet Hashing architecture to handle large-scale place recognition challenges in this study, leveraging the capabilities of Vision Transformer (ViT). For feature representation, we create a Siamese Pyramid Transformer backbone. We present a multi-scale feature aggregation technique to learn discriminative features for scale-invariant features. In addition, we observe that binary codes suitable for place recognition are sub-optimal. To overcome this issue, we use a self-restraint triplet loss deep learning network to create compact hash codes, further increasing recognition accuracy. To the best of our knowledge, this is the first study to use a triplet loss deep learning network to handle the deep hashing learning problem. We do extensive experiments on four difficult place datasets: KITTI, Nordland, VPRICE, and EuRoC. The experimental findings reveal that the suggested technique performs at the cutting edge of large-scale visual place recognition challenges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems