{"title":"Efficient continuous kNN join over dynamic high-dimensional data","authors":"Nimish Ukey, Guangjian Zhang, Zhengyi Yang, Binghao Li, Wei Li, Wenjie Zhang","doi":"10.1007/s11280-023-01204-9","DOIUrl":null,"url":null,"abstract":"Abstract Given a user dataset $$\\varvec{U}$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mrow> <mml:mi>U</mml:mi> </mml:mrow> </mml:math> and an object dataset $$\\varvec{I}$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mrow> <mml:mi>I</mml:mi> </mml:mrow> </mml:math> , a kNN join query in high-dimensional space returns the $$\\varvec{k}$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mrow> <mml:mi>k</mml:mi> </mml:mrow> </mml:math> nearest neighbors of each object in dataset $$\\varvec{U}$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mrow> <mml:mi>U</mml:mi> </mml:mrow> </mml:math> from the object dataset $$\\varvec{I}$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mrow> <mml:mi>I</mml:mi> </mml:mrow> </mml:math> . The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR $$^+$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:msup> <mml:mrow /> <mml:mo>+</mml:mo> </mml:msup> </mml:math> Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree.","PeriodicalId":49356,"journal":{"name":"World Wide Web-Internet and Web Information Systems","volume":"11 1","pages":"0"},"PeriodicalIF":2.7000,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Wide Web-Internet and Web Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11280-023-01204-9","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract Given a user dataset $$\varvec{U}$$ U and an object dataset $$\varvec{I}$$ I , a kNN join query in high-dimensional space returns the $$\varvec{k}$$ k nearest neighbors of each object in dataset $$\varvec{U}$$ U from the object dataset $$\varvec{I}$$ I . The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR $$^+$$ + Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree.
期刊介绍:
World Wide Web: Internet and Web Information Systems (WWW) is an international, archival, peer-reviewed journal which covers all aspects of the World Wide Web, including issues related to architectures, applications, Internet and Web information systems, and communities. The purpose of this journal is to provide an international forum for researchers, professionals, and industrial practitioners to share their rapidly developing knowledge and report on new advances in Internet and web-based systems. The journal also focuses on all database- and information-system topics that relate to the Internet and the Web, particularly on ways to model, design, develop, integrate, and manage these systems.
Appearing quarterly, the journal publishes (1) papers describing original ideas and new results, (2) vision papers, (3) reviews of important techniques in related areas, (4) innovative application papers, and (5) progress reports on major international research projects. Papers published in the WWW journal deal with subjects directly or indirectly related to the World Wide Web. The WWW journal provides timely, in-depth coverage of the most recent developments in the World Wide Web discipline to enable anyone involved to keep up-to-date with this dynamically changing technology.