Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

Weiren Yu, J. Mccann, Chengyuan Zhang, H. Ferhatosmanoğlu
{"title":"Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs","authors":"Weiren Yu, J. Mccann, Chengyuan Zhang, H. Ferhatosmanoğlu","doi":"10.1145/3495209","DOIUrl":null,"url":null,"abstract":"SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [24] for retrieving SimRank does not always produce high-quality similarity results, as it fails to accurately obtain diagonal correction matrix D. Moreover, SimRank has a “connectivity trait” problem: increasing the number of paths between a pair of nodes would decrease its similarity score. The best-known remedy, SimRank++ [1], cannot completely fix this problem, since its score would still be zero if there are no common in-neighbors between two nodes. In this article, we study fast high-quality link-based similarity search on billion-scale graphs. (1) We first devise a “varied-D” method to accurately compute SimRank in linear memory. We also aggregate duplicate computations, which reduces the time of [24] from quadratic to linear in the number of iterations. (2) We propose a novel “cosine-based” SimRank model to circumvent the “connectivity trait” problem. (3) To substantially speed up the partial-pairs “cosine-based” SimRank search on large graphs, we devise an efficient dimensionality reduction algorithm, PSR#, with guaranteed accuracy. (4) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument in [24] that “if D is replaced by a scaled identity matrix (1-Ɣ)I, their top-K rankings will not be affected much”. (5) We propose a novel method that can accurately convert from Li et al. SimRank ~{S} to Jeh and Widom’s SimRank S. (6) We propose GSR#, a generalisation of our “cosine-based” SimRank model, to quantify pairwise similarities across two distinct graphs, unlike SimRank that would assess nodes across two graphs as completely dissimilar. Extensive experiments on various datasets demonstrate the superiority of our proposed approaches in terms of high search quality, computational efficiency, accuracy, and scalability on billion-edge graphs.","PeriodicalId":6934,"journal":{"name":"ACM Transactions on Information Systems (TOIS)","volume":"34 1","pages":"1 - 45"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems (TOIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3495209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [24] for retrieving SimRank does not always produce high-quality similarity results, as it fails to accurately obtain diagonal correction matrix D. Moreover, SimRank has a “connectivity trait” problem: increasing the number of paths between a pair of nodes would decrease its similarity score. The best-known remedy, SimRank++ [1], cannot completely fix this problem, since its score would still be zero if there are no common in-neighbors between two nodes. In this article, we study fast high-quality link-based similarity search on billion-scale graphs. (1) We first devise a “varied-D” method to accurately compute SimRank in linear memory. We also aggregate duplicate computations, which reduces the time of [24] from quadratic to linear in the number of iterations. (2) We propose a novel “cosine-based” SimRank model to circumvent the “connectivity trait” problem. (3) To substantially speed up the partial-pairs “cosine-based” SimRank search on large graphs, we devise an efficient dimensionality reduction algorithm, PSR#, with guaranteed accuracy. (4) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument in [24] that “if D is replaced by a scaled identity matrix (1-Ɣ)I, their top-K rankings will not be affected much”. (5) We propose a novel method that can accurately convert from Li et al. SimRank ~{S} to Jeh and Widom’s SimRank S. (6) We propose GSR#, a generalisation of our “cosine-based” SimRank model, to quantify pairwise similarities across two distinct graphs, unlike SimRank that would assess nodes across two graphs as completely dissimilar. Extensive experiments on various datasets demonstrate the superiority of our proposed approaches in terms of high search quality, computational efficiency, accuracy, and scalability on billion-edge graphs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在十亿边图上扩展高质量的基于成对链接的相似性检索
simmrank是一种有吸引力的基于链接的相似性度量,用于网络搜索和社会计量学的肥沃领域。然而,现有的Kusumoto等人[24]的确定性方法在检索simmrank时,由于不能准确地获得对角修正矩阵d,并不能得到高质量的相似度结果。此外,simmrank存在“连通性”问题:增加一对节点之间的路径数会降低其相似度得分。最著名的补救方法simrank++[1]不能完全解决这个问题,因为如果两个节点之间没有共同的内邻居,它的得分仍然是零。在本文中,我们研究了在十亿尺度图上快速高质量的基于链接的相似度搜索。(1)我们首先设计了一种“变d”方法来精确计算线性存储器中的simmrank。我们还聚合了重复计算,这将[24]的迭代次数从二次型减少到线性。(2)我们提出了一种新的基于余弦的simmrank模型来规避“连通性特征”问题。(3)为了大大加快在大图上的部分对“基于余弦”的simmrank搜索,我们设计了一种有效的降维算法psr#,并保证了准确性。(4)我们对simmrank及其变体之间的语义差异进行了数学分析,并纠正了[24]中的一个论点,即“如果D被缩放的单位矩阵(1-Ɣ)I取代,它们的top-K排名不会受到太大影响”。(5)我们提出了一种新的方法,可以准确地从Li等人那里转换。(6)我们提出了GSR#,这是我们的“基于余弦的”simmrank模型的推广,用于量化两个不同图之间的两两相似性,不像simmrank会评估两个图之间的节点完全不相似。在各种数据集上进行的大量实验表明,我们提出的方法在高搜索质量、计算效率、准确性和数十亿边图的可扩展性方面具有优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Collaborative Graph Learning for Session-based Recommendation GraphHINGE: Learning Interaction Models of Structured Neighborhood on Heterogeneous Information Network Scalable Representation Learning for Dynamic Heterogeneous Information Networks via Metagraphs Complex-valued Neural Network-based Quantum Language Models eFraudCom: An E-commerce Fraud Detection System via Competitive Graph Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1