高效的基于simmrank的相似性连接

ACM Transactions on Database Systems (TODS) Pub Date : 2017-07-31 DOI:10.1145/3083899

Weiguo Zheng, Lei Zou, Lei Chen, Dongyan Zhao

{"title":"高效的基于simmrank的相似性连接","authors":"Weiguo Zheng, Lei Zou, Lei Chen, Dongyan Zhao","doi":"10.1145/3083899","DOIUrl":null,"url":null,"abstract":"Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V. To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover + vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover + vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"16 1","pages":"1 - 37"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Efficient SimRank-Based Similarity Join\",\"authors\":\"Weiguo Zheng, Lei Zou, Lei Chen, Dongyan Zhao\",\"doi\":\"10.1145/3083899\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V. To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover + vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover + vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.\",\"PeriodicalId\":6983,\"journal\":{\"name\":\"ACM Transactions on Database Systems (TODS)\",\"volume\":\"16 1\",\"pages\":\"1 - 37\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Database Systems (TODS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3083899\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems (TODS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3083899","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

在许多实际应用程序中，图被广泛用于为复杂数据建模。回答大型图上的顶点连接查询是有意义和有趣的，这可以有利于社交网络中的朋友推荐和链接预测等。在本文中，由于“simmrank”的通用性，我们采用“simmrank”[13]来评估大型图中两个顶点之间的相似性。请注意，“Simank”是纯粹依赖于结构的，它不依赖于领域知识。具体来说，我们定义了一个基于simmrank的连接(SRJ)查询，从两组顶点U和v中找到满足阈值的所有顶点对。为了减少搜索空间，我们提出了一个基于最短路径距离的simmrank分数上界，以修剪没有希望的顶点对。在验证中，我们提出了一种新的索引，称为h-go cover+，以有效地计算任何单个顶点对的simmrank分数。给定一个图G，我们只物化了一小部分顶点对(即h-go覆盖+顶点对)的simmrank分数，在此基础上可以很容易地计算出任何顶点对的simmrank分数。为了求出h-go覆盖+顶点对，我们提出了一种无需构建顶点对图的高效方法。因此，可以很容易地处理大型图。在真实和合成数据集上进行的大量实验证实了我们的解决方案的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Efficient SimRank-Based Similarity Join

Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V. To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover + vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover + vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Database Systems (TODS)

自引率

0.00%

发文量

期刊最新文献

On Finding Rank Regret Representatives Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration Persistent Summaries Influence Maximization Revisited: Efficient Sampling with Bound Tightened The Space-Efficient Core of Vadalog