Fast nearest-neighbor search in disk-resident graphs

Purnamrita Sarkar, A. Moore
{"title":"Fast nearest-neighbor search in disk-resident graphs","authors":"Purnamrita Sarkar, A. Moore","doi":"10.1145/1835804.1835871","DOIUrl":null,"url":null,"abstract":"Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most \"similar\" k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal (~ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1835804.1835871","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 58

Abstract

Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most "similar" k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal (~ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
磁盘驻留图中的快速最近邻搜索
链接预测、个性化图搜索、欺诈检测以及许多这样的图挖掘问题都围绕着与给定查询节点最“相似”的k个节点的计算。一种广泛使用的相似性度量方法是基于图上的随机漫步,例如,个性化页面排名,点击和通勤时间,以及simmrank。与这些措施相关的基本问题有两个。首先,现有的在线算法通常检查查询节点的本地邻域,当遇到高节点时(现实世界图中的常见现象),查询节点的本地邻域会变得非常慢。我们证明了将高节点转化为汇聚的结果只有很小的近似误差,同时大大提高了运行时间。第二个问题是,当图太大而无法占用内存时,在查询时计算相似度。显而易见的解决方案是将图分割成节点集群,并将每个集群存储在磁盘页面上;理想情况下,随机漫步很少会跨越集群边界并导致页面错误。我们在这里的贡献有两个:(a)我们提出了一种高效的确定性算法,可以在这样的聚类图中找到任何查询节点的k个最近邻(就个性化页面排名而言),以及(b)我们开发了一种聚类算法(RWDISK),它只对数据文件进行顺序扫描。在DBLP、Citeseer和Live-Journal等几个大型公开图(~ 90 M边)上的实证结果表明,将高节点转化为汇点不仅使RWDISK的运行时间提高了3倍,而且平均将链路预测精度提高了4倍。我们还表明,RWDISK比流行的聚类算法METIS返回更理想的簇(高电导和小尺寸),同时需要更少的内存。最后,我们用于计算最近邻的确定性算法比实际模拟随机漫步产生的页面错误(5倍)要少得多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Frequent regular itemset mining Suggesting friends using the implicit social graph Collusion-resistant privacy-preserving data mining Mining advisor-advisee relationships from research publication networks Session details: Research track 5: classification models and tools
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1