{"title":"多尺度矩阵采样与次线性时间PageRank计算","authors":"C. Borgs, Mickey Brautbar, J. Chayes, S. Teng","doi":"10.1080/15427951.2013.802752","DOIUrl":null,"url":null,"abstract":"Abstract A fundamental problem arising in many applications in Web science and social network analysis is the problem of identifying all nodes in a network whose PageRank exceeds a given threshold Δ. In this paper, we study the probabilistic version of the problem whereby given an arbitrary approximation factor c > 1, we are asked to output a set S of nodes such that with high probability, S contains all nodes of PageRank at least Δ, and no node of PageRank smaller than Δ/c. We call this problem SignificantPageRanks. We develop a nearly optimal local algorithm for the problem with time complexity on networks with n nodes, where the tilde hides a polylogarithmic factor. We show that every algorithm for solving this problem must have running time of Ω(n/Δ), rendering our algorithm optimal up to logarithmic factors. Our algorithm has sublinear time complexity for applications including Web crawling and Web search that require efficient identification of nodes whose PageRanks are above a threshold Δ = nδ, for some constant 0 < δ < 1. Our algorithm comes with two main technical contributions. The first is a multiscale sampling scheme for a basic matrix problem that could be of interest on its own. For us, it appears as an abstraction of a subproblem we need to tackle in order to solve the SignificantPageRanks problem, but we hope that this abstraction will be useful in designing fast algorithms for identifying nodes that are significant beyond PageRank measurements. In the abstract matrix problem, it is assumed that one can access an unknown right-stochastic matrix by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter ε. At a cost propositional to 1/ε, the query will return a list of O(1/ε) entries and their indices that provide an ε-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least Δ and omits every column whose sum is less than Δ/c. Our multiscale sampling scheme solves this problem with cost , while traditional sampling algorithms would take time Θ((n/Δ)2). Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in [Jeh and Widom 03, Andersen et al. 06] and is highly efficient, particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme, we are able to solve the SignificantPageRanks problem optimally.","PeriodicalId":38105,"journal":{"name":"Internet Mathematics","volume":"10 1","pages":"20 - 48"},"PeriodicalIF":0.0000,"publicationDate":"2012-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15427951.2013.802752","citationCount":"31","resultStr":"{\"title\":\"Multiscale Matrix Sampling and Sublinear-Time PageRank Computation\",\"authors\":\"C. Borgs, Mickey Brautbar, J. Chayes, S. Teng\",\"doi\":\"10.1080/15427951.2013.802752\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract A fundamental problem arising in many applications in Web science and social network analysis is the problem of identifying all nodes in a network whose PageRank exceeds a given threshold Δ. In this paper, we study the probabilistic version of the problem whereby given an arbitrary approximation factor c > 1, we are asked to output a set S of nodes such that with high probability, S contains all nodes of PageRank at least Δ, and no node of PageRank smaller than Δ/c. We call this problem SignificantPageRanks. We develop a nearly optimal local algorithm for the problem with time complexity on networks with n nodes, where the tilde hides a polylogarithmic factor. We show that every algorithm for solving this problem must have running time of Ω(n/Δ), rendering our algorithm optimal up to logarithmic factors. Our algorithm has sublinear time complexity for applications including Web crawling and Web search that require efficient identification of nodes whose PageRanks are above a threshold Δ = nδ, for some constant 0 < δ < 1. Our algorithm comes with two main technical contributions. The first is a multiscale sampling scheme for a basic matrix problem that could be of interest on its own. For us, it appears as an abstraction of a subproblem we need to tackle in order to solve the SignificantPageRanks problem, but we hope that this abstraction will be useful in designing fast algorithms for identifying nodes that are significant beyond PageRank measurements. In the abstract matrix problem, it is assumed that one can access an unknown right-stochastic matrix by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter ε. At a cost propositional to 1/ε, the query will return a list of O(1/ε) entries and their indices that provide an ε-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least Δ and omits every column whose sum is less than Δ/c. Our multiscale sampling scheme solves this problem with cost , while traditional sampling algorithms would take time Θ((n/Δ)2). Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in [Jeh and Widom 03, Andersen et al. 06] and is highly efficient, particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme, we are able to solve the SignificantPageRanks problem optimally.\",\"PeriodicalId\":38105,\"journal\":{\"name\":\"Internet Mathematics\",\"volume\":\"10 1\",\"pages\":\"20 - 48\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1080/15427951.2013.802752\",\"citationCount\":\"31\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Internet Mathematics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/15427951.2013.802752\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Internet Mathematics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/15427951.2013.802752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 31
摘要
在Web科学和社会网络分析的许多应用中出现的一个基本问题是识别网络中PageRank超过给定阈值Δ的所有节点的问题。在本文中,我们研究了该问题的概率版本,即给定任意近似因子c > 1,我们被要求输出一个节点集S,使得在高概率下,S包含PageRank的所有节点至少Δ,并且没有PageRank的节点小于Δ/c。我们称这个问题为显著网页排名。我们开发了一种近乎最优的局部算法来解决n节点网络上的时间复杂度问题,其中波浪隐藏了一个多对数因子。我们证明,解决这个问题的每个算法必须具有Ω(n/Δ)的运行时间,使我们的算法优化到对数因子。对于需要有效识别pagerank超过阈值Δ = nδ(对于某些常数0 < Δ < 1)的节点的Web爬行和Web搜索等应用,我们的算法具有亚线性时间复杂度。我们的算法有两个主要的技术贡献。第一个是一个基本矩阵问题的多尺度抽样方案,它本身可能很有趣。对我们来说,它似乎是为了解决显著PageRank问题而需要解决的子问题的抽象,但我们希望这种抽象将有助于设计快速算法来识别超越PageRank测量的重要节点。在抽象矩阵问题中,假设一个人可以通过查询其行来访问未知的右随机矩阵,其中查询的代价和答案的准确性取决于精度参数ε。在代价命题为1/ε时,查询将返回一个由O(1/ε)个条目组成的列表,以及它们的索引,这些索引提供了行的ε精度近似值。我们的任务是找到一个集合,它包含总和至少为Δ的所有列,并省略总和小于Δ/c的所有列。我们的多尺度采样方案用成本解决了这个问题,而传统的采样算法需要时间Θ((n/Δ)2)。我们的第二个主要技术贡献是一种新的局部算法,用于近似个性化PageRank,它比[Jeh and wisdom 03, Andersen et al. 06]中开发的早期算法更鲁棒,并且效率很高,特别是对于具有较大的进度或出度的网络。结合我们的多尺度采样方案,我们能够最优地解决显著网页排名问题。
Multiscale Matrix Sampling and Sublinear-Time PageRank Computation
Abstract A fundamental problem arising in many applications in Web science and social network analysis is the problem of identifying all nodes in a network whose PageRank exceeds a given threshold Δ. In this paper, we study the probabilistic version of the problem whereby given an arbitrary approximation factor c > 1, we are asked to output a set S of nodes such that with high probability, S contains all nodes of PageRank at least Δ, and no node of PageRank smaller than Δ/c. We call this problem SignificantPageRanks. We develop a nearly optimal local algorithm for the problem with time complexity on networks with n nodes, where the tilde hides a polylogarithmic factor. We show that every algorithm for solving this problem must have running time of Ω(n/Δ), rendering our algorithm optimal up to logarithmic factors. Our algorithm has sublinear time complexity for applications including Web crawling and Web search that require efficient identification of nodes whose PageRanks are above a threshold Δ = nδ, for some constant 0 < δ < 1. Our algorithm comes with two main technical contributions. The first is a multiscale sampling scheme for a basic matrix problem that could be of interest on its own. For us, it appears as an abstraction of a subproblem we need to tackle in order to solve the SignificantPageRanks problem, but we hope that this abstraction will be useful in designing fast algorithms for identifying nodes that are significant beyond PageRank measurements. In the abstract matrix problem, it is assumed that one can access an unknown right-stochastic matrix by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter ε. At a cost propositional to 1/ε, the query will return a list of O(1/ε) entries and their indices that provide an ε-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least Δ and omits every column whose sum is less than Δ/c. Our multiscale sampling scheme solves this problem with cost , while traditional sampling algorithms would take time Θ((n/Δ)2). Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in [Jeh and Widom 03, Andersen et al. 06] and is highly efficient, particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme, we are able to solve the SignificantPageRanks problem optimally.