Fast nearest-neighbor search in disk-resident graphs

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI:10.1145/1835804.1835871

Purnamrita Sarkar, A. Moore

{"title":"Fast nearest-neighbor search in disk-resident graphs","authors":"Purnamrita Sarkar, A. Moore","doi":"10.1145/1835804.1835871","DOIUrl":null,"url":null,"abstract":"Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most \"similar\" k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal (~ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1835804.1835871","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 58

Abstract

Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most "similar" k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal (~ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

磁盘驻留图中的快速最近邻搜索

链接预测、个性化图搜索、欺诈检测以及许多这样的图挖掘问题都围绕着与给定查询节点最“相似”的k个节点的计算。一种广泛使用的相似性度量方法是基于图上的随机漫步，例如，个性化页面排名，点击和通勤时间，以及simmrank。与这些措施相关的基本问题有两个。首先，现有的在线算法通常检查查询节点的本地邻域，当遇到高节点时(现实世界图中的常见现象)，查询节点的本地邻域会变得非常慢。我们证明了将高节点转化为汇聚的结果只有很小的近似误差，同时大大提高了运行时间。第二个问题是，当图太大而无法占用内存时，在查询时计算相似度。显而易见的解决方案是将图分割成节点集群，并将每个集群存储在磁盘页面上;理想情况下，随机漫步很少会跨越集群边界并导致页面错误。我们在这里的贡献有两个:(a)我们提出了一种高效的确定性算法，可以在这样的聚类图中找到任何查询节点的k个最近邻(就个性化页面排名而言)，以及(b)我们开发了一种聚类算法(RWDISK)，它只对数据文件进行顺序扫描。在DBLP、Citeseer和Live-Journal等几个大型公开图(~ 90 M边)上的实证结果表明，将高节点转化为汇点不仅使RWDISK的运行时间提高了3倍，而且平均将链路预测精度提高了4倍。我们还表明，RWDISK比流行的聚类算法METIS返回更理想的簇(高电导和小尺寸)，同时需要更少的内存。最后，我们用于计算最近邻的确定性算法比实际模拟随机漫步产生的页面错误(5倍)要少得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量

期刊最新文献

Frequent regular itemset mining Suggesting friends using the implicit social graph Collusion-resistant privacy-preserving data mining Mining advisor-advisee relationships from research publication networks Session details: Research track 5: classification models and tools