Improving LSH via tensorized random projection

IF 0.4 4区计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Acta Informatica Pub Date : 2025-02-04 DOI:10.1007/s00236-025-00479-x

Bhisham Dev Verma, Rameshwar Pratap

{"title":"Improving LSH via tensorized random projection","authors":"Bhisham Dev Verma, Rameshwar Pratap","doi":"10.1007/s00236-025-00479-x","DOIUrl":null,"url":null,"abstract":"<div><p>Locality-sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large-scale data processing applications such as near-duplicate detection, nearest-neighbour search, clustering, etc. In this work, we aim to propose faster and space-efficient locality-sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data. However, this approach becomes impractical for higher-order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH’s parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely CP-E2LSH, TT-E2LSH, and CP-SRP, TT-SRP, respectively, building on CP and tensor train (TT) decompositions techniques. Our approaches are space-efficient and can be efficiently applied to low-rank CP or TT tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.</p></div>","PeriodicalId":7189,"journal":{"name":"Acta Informatica","volume":"62 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Informatica","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s00236-025-00479-x","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Locality-sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large-scale data processing applications such as near-duplicate detection, nearest-neighbour search, clustering, etc. In this work, we aim to propose faster and space-efficient locality-sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data. However, this approach becomes impractical for higher-order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH’s parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely CP-E2LSH, TT-E2LSH, and CP-SRP, TT-SRP, respectively, building on CP and tensor train (TT) decompositions techniques. Our approaches are space-efficient and can be efficiently applied to low-rank CP or TT tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Acta Informatica 工程技术-计算机：信息系统

CiteScore

2.40

自引率

16.70%

发文量

审稿时长

>12 weeks

期刊介绍： Acta Informatica provides international dissemination of articles on formal methods for the design and analysis of programs, computing systems and information structures, as well as related fields of Theoretical Computer Science such as Automata Theory, Logic in Computer Science, and Algorithmics. Topics of interest include: • semantics of programming languages • models and modeling languages for concurrent, distributed, reactive and mobile systems • models and modeling languages for timed, hybrid and probabilistic systems • specification, program analysis and verification • model checking and theorem proving • modal, temporal, first- and higher-order logics, and their variants • constraint logic, SAT/SMT-solving techniques • theoretical aspects of databases, semi-structured data and finite model theory • theoretical aspects of artificial intelligence, knowledge representation, description logic • automata theory, formal languages, term and graph rewriting • game-based models, synthesis • type theory, typed calculi • algebraic, coalgebraic and categorical methods • formal aspects of performance, dependability and reliability analysis • foundations of information and network security • parallel, distributed and randomized algorithms • design and analysis of algorithms • foundations of network and communication protocols.

期刊最新文献

Improving LSH via tensorized random projection Tight bounds for the sensitivity of CDAWGs with left-end edits Correction: Birkhoff-von Neumann quantum logic enriched with entanglement quantifiers: coincidence theorem and semantic consequence Directed capacity-preserving subgraphs: hardness and exact polynomial algorithms Fault-tolerance in distance-edge-monitoring sets