Pub Date : 2019-01-01DOI: 10.4230/LIPICS.ICDT.2019.5
Alejandro Grez, Cristian Riveros, M. Ugarte
Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees. In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. Finally, we gather the main results of this work to present an efficient and declarative framework for CEP.
{"title":"A Formal Framework for Complex Event Processing","authors":"Alejandro Grez, Cristian Riveros, M. Ugarte","doi":"10.4230/LIPICS.ICDT.2019.5","DOIUrl":"https://doi.org/10.4230/LIPICS.ICDT.2019.5","url":null,"abstract":"Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees. In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. Finally, we gather the main results of this work to present an efficient and declarative framework for CEP.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"2 1","pages":"5:1-5:18"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76726950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-01-01DOI: 10.4230/LIPIcs.ICDT.2019.14
F. Neven, T. Schwentick, Christopher Spinrath, Brecht Vandevoort
Recently, Ketsman et al. started the investigation of the parallel evaluation of recursive queries in the Massively Parallel Communication (MPC) model. Among other things, it was shown that parallelcorrectness and parallel-boundedness for general Datalog programs is undecidable, by a reduction from the undecidable containment problem for Datalog. Furthermore, economic policies were introduced as a means to specify data distribution in a recursive setting. In this paper, we extend the latter framework to account for more general distributed evaluation strategies in terms of communication policies. We then show that the undecidability of parallel-correctness runs deeper: it already holds for fragments of Datalog, e.g., monadic and frontier-guarded Datalog, with a decidable containment problem, under relatively simple evaluation strategies. These simple evaluation strategies are defined w.r.t. data-moving distribution constraints. We then investigate restrictions of economic policies that yield decidability. In particular, we show that parallel-correctness is 2EXPTIME-complete for monadic and frontier-guarded Datalog under hash-based economic policies. Next, we consider restrictions of data-moving constraints and show that parallel-correctness and parallel-boundedness are 2EXPTIME-complete for frontier-guarded Datalog. Interestingly, distributed evaluation no longer preserves the usual containment relationships between fragments of Datalog. Indeed, not every monadic Datalog program is equivalent to a frontier-guarded one in the distributed setting. We illustrate the latter by considering two alternative settings where in one of these parallel-correctness is decidable for frontier-guarded Datalog but undecidable for monadic Datalog. 2012 ACM Subject Classification Theory of computation → Database theory
{"title":"Parallel-Correctness and Parallel-Boundedness for Datalog Programs","authors":"F. Neven, T. Schwentick, Christopher Spinrath, Brecht Vandevoort","doi":"10.4230/LIPIcs.ICDT.2019.14","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2019.14","url":null,"abstract":"Recently, Ketsman et al. started the investigation of the parallel evaluation of recursive queries in the Massively Parallel Communication (MPC) model. Among other things, it was shown that parallelcorrectness and parallel-boundedness for general Datalog programs is undecidable, by a reduction from the undecidable containment problem for Datalog. Furthermore, economic policies were introduced as a means to specify data distribution in a recursive setting. In this paper, we extend the latter framework to account for more general distributed evaluation strategies in terms of communication policies. We then show that the undecidability of parallel-correctness runs deeper: it already holds for fragments of Datalog, e.g., monadic and frontier-guarded Datalog, with a decidable containment problem, under relatively simple evaluation strategies. These simple evaluation strategies are defined w.r.t. data-moving distribution constraints. We then investigate restrictions of economic policies that yield decidability. In particular, we show that parallel-correctness is 2EXPTIME-complete for monadic and frontier-guarded Datalog under hash-based economic policies. Next, we consider restrictions of data-moving constraints and show that parallel-correctness and parallel-boundedness are 2EXPTIME-complete for frontier-guarded Datalog. Interestingly, distributed evaluation no longer preserves the usual containment relationships between fragments of Datalog. Indeed, not every monadic Datalog program is equivalent to a frontier-guarded one in the distributed setting. We illustrate the latter by considering two alternative settings where in one of these parallel-correctness is decidable for frontier-guarded Datalog but undecidable for monadic Datalog. 2012 ACM Subject Classification Theory of computation → Database theory","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"47 1","pages":"14:1-14:19"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86257128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-01-01DOI: 10.4230/LIPIcs.ICDT.2019.3
M. Krötzsch, Maximilian Marx, S. Rudolph
{"title":"The Power of the Terminating Chase (Invited Talk)","authors":"M. Krötzsch, Maximilian Marx, S. Rudolph","doi":"10.4230/LIPIcs.ICDT.2019.3","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2019.3","url":null,"abstract":"","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"9 1","pages":"3:1-3:17"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80834751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-01-01DOI: 10.4230/LIPIcs.ICDT.2019.9
A. Ganguly, J. Munro, Yakov Nekrich, R. Shah, Sharma V. Thankachan
In this paper, we consider a variant of the color range reporting problem called color reporting with frequencies. Our goal is to pre-process a set of colored points into a data structure, so that given a query range Q, we can report all colors that appear in Q, along with their respective frequencies. In other words, for each reported color, we also output the number of times it occurs in Q. We describe an external-memory data structure that uses O(N(1 + log2 D/ logN)) words and answers one-dimensional queries in O(1 +K/B) I/Os, where N is the total number of points in the data structure, D is the total number of colors in the data structure, K is the number of reported colors, and B is the block size. Next we turn to an approximate version of this problem: report all colors σ that appear in the query range; for every reported color, we provide a constant-factor approximation on its frequency. We consider color reporting with approximate frequencies in two dimensions. Our data structure uses O(N) space and answers two-dimensional queries in O(logB N + log∗B +K/B) I/Os in the special case when the query range is bounded on two sides. As a corollary, we can also answer one-dimensional approximate queries within the same time and space bounds. 2012 ACM Subject Classification Theory of computation → Data structures design and analysis
在本文中,我们考虑了颜色范围报告问题的一种变体,称为带频率的颜色报告。我们的目标是将一组有颜色的点预处理成一个数据结构,这样,给定一个查询范围Q,我们就可以报告Q中出现的所有颜色,以及它们各自的频率。换句话说,对于每个报告的颜色,我们也输出它在q中出现的次数。我们描述了一个外部内存数据结构,它使用O(N(1 + log2 D/ logN))个单词,并在O(1 +K/B)个I/O中回答一维查询,其中N是数据结构中点的总数,D是数据结构中颜色的总数,K是报告的颜色的数量,B是块大小。接下来我们转到这个问题的近似版本:报告在查询范围内出现的所有颜色σ;对于每一种报告的颜色,我们提供其频率的常数因子近似值。我们考虑二维近似频率的颜色报告。我们的数据结构使用O(N)空间,在查询范围两侧有界的特殊情况下,以O(logB N + log * B +K/B) I/O回答二维查询。作为推论,我们也可以在相同的时间和空间范围内回答一维近似查询。2012 ACM学科分类:计算理论→数据结构设计与分析
{"title":"Categorical Range Reporting with Frequencies","authors":"A. Ganguly, J. Munro, Yakov Nekrich, R. Shah, Sharma V. Thankachan","doi":"10.4230/LIPIcs.ICDT.2019.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2019.9","url":null,"abstract":"In this paper, we consider a variant of the color range reporting problem called color reporting with frequencies. Our goal is to pre-process a set of colored points into a data structure, so that given a query range Q, we can report all colors that appear in Q, along with their respective frequencies. In other words, for each reported color, we also output the number of times it occurs in Q. We describe an external-memory data structure that uses O(N(1 + log2 D/ logN)) words and answers one-dimensional queries in O(1 +K/B) I/Os, where N is the total number of points in the data structure, D is the total number of colors in the data structure, K is the number of reported colors, and B is the block size. Next we turn to an approximate version of this problem: report all colors σ that appear in the query range; for every reported color, we provide a constant-factor approximation on its frequency. We consider color reporting with approximate frequencies in two dimensions. Our data structure uses O(N) space and answers two-dimensional queries in O(logB N + log∗B +K/B) I/Os in the special case when the query range is bounded on two sides. As a corollary, we can also answer one-dimensional approximate queries within the same time and space bounds. 2012 ACM Subject Classification Theory of computation → Data structures design and analysis","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"22 1","pages":"9:1-9:19"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73119728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-01-01DOI: 10.4230/LIPICS.ICDT.2019.10
Nirman Kumar, Benjamin Raichel, Stavros Sintos, G. V. Buskirk
In multi-parameter decision making, data is usually modeled as a set of points whose dimension is the number of parameters, and the skyline or Pareto points represent the possible optimal solutions for various optimization problems. The structure and computation of such points have been well studied, particularly in the database community. As the skyline can be quite large in high dimensions, one often seeks a compact summary. In particular, for a given integer parameter k, a subset of k points is desired which best approximates the skyline under some measure. Various measures have been proposed, but they mostly treat the skyline as a discrete object. By viewing the skyline as a continuous geometric hull, we propose a new measure that evaluates the quality of a subset by the Hausdorff distance of its hull to the full hull. We argue that in many ways our measure more naturally captures what it means to approximate the skyline. For our new geometric skyline approximation measure, we provide a plethora of results. Specifically, we provide (1) a near linear time exact algorithm in two dimensions, (2) APX-hardness results for dimensions three and higher, (3) approximation algorithms for related variants of our problem, and (4) a practical and efficient heuristic which uses our geometric insights into the problem, as well as various experimental results to show the efficacy of our approach.
{"title":"Approximating Distance Measures for the Skyline","authors":"Nirman Kumar, Benjamin Raichel, Stavros Sintos, G. V. Buskirk","doi":"10.4230/LIPICS.ICDT.2019.10","DOIUrl":"https://doi.org/10.4230/LIPICS.ICDT.2019.10","url":null,"abstract":"In multi-parameter decision making, data is usually modeled as a set of points whose dimension is the number of parameters, and the skyline or Pareto points represent the possible optimal solutions for various optimization problems. The structure and computation of such points have been well studied, particularly in the database community. As the skyline can be quite large in high dimensions, one often seeks a compact summary. In particular, for a given integer parameter k, a subset of k points is desired which best approximates the skyline under some measure. Various measures have been proposed, but they mostly treat the skyline as a discrete object. By viewing the skyline as a continuous geometric hull, we propose a new measure that evaluates the quality of a subset by the Hausdorff distance of its hull to the full hull. We argue that in many ways our measure more naturally captures what it means to approximate the skyline. For our new geometric skyline approximation measure, we provide a plethora of results. Specifically, we provide (1) a near linear time exact algorithm in two dimensions, (2) APX-hardness results for dimensions three and higher, (3) approximation algorithms for related variants of our problem, and (4) a practical and efficient heuristic which uses our geometric insights into the problem, as well as various experimental results to show the efficacy of our approach.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"3 1","pages":"10:1-10:20"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76930745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-01-01DOI: 10.4230/LIPIcs.ICDT.2019.2
L. Getoor
{"title":"The Power of Relational Learning (Invited Talk)","authors":"L. Getoor","doi":"10.4230/LIPIcs.ICDT.2019.2","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2019.2","url":null,"abstract":"","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"5 1","pages":"2:1-2:1"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82481828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-01DOI: 10.46298/lmcs-18(1:5)2022
Batya Kenig, Dan Suciu
Integrity constraints such as functional dependencies (FD) and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes "in the limit". Then, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Finally, we show how some of the results in the paper can be derived using the I-measure theory, which relates between information theoretic measures and set theory. Our results recover, and sometimes extend, previously known results about the implication problem: the implication of MVDs and FDs can be checked by considering only 2-tuple relations.
{"title":"Integrity Constraints Revisited: From Exact to Approximate Implication","authors":"Batya Kenig, Dan Suciu","doi":"10.46298/lmcs-18(1:5)2022","DOIUrl":"https://doi.org/10.46298/lmcs-18(1:5)2022","url":null,"abstract":"Integrity constraints such as functional dependencies (FD) and multi-valued\u0000dependencies (MVD) are fundamental in database schema design. Likewise,\u0000probabilistic conditional independences (CI) are crucial for reasoning about\u0000multivariate probability distributions. The implication problem studies whether\u0000a set of constraints (antecedents) implies another constraint (consequent), and\u0000has been investigated in both the database and the AI literature, under the\u0000assumption that all constraints hold exactly. However, many applications today\u0000consider constraints that hold only approximately. In this paper we define an\u0000approximate implication as a linear inequality between the degree of\u0000satisfaction of the antecedents and consequent, and we study the relaxation\u0000problem: when does an exact implication relax to an approximate implication? We\u0000use information theory to define the degree of satisfaction, and prove several\u0000results. First, we show that any implication from a set of data dependencies\u0000(MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most\u0000quadratic in the number of variables; when the consequent is an FD, the factor\u0000can be reduced to 1. Second, we prove that there exists an implication between\u0000CIs that does not admit any relaxation; however, we prove that every\u0000implication between CIs relaxes \"in the limit\". Then, we show that the\u0000implication problem for differential constraints in market basket analysis also\u0000admits a relaxation with a factor equal to 1. Finally, we show how some of the\u0000results in the paper can be derived using the I-measure theory, which relates\u0000between information theoretic measures and set theory. Our results recover, and\u0000sometimes extend, previously known results about the implication problem: the\u0000implication of MVDs and FDs can be checked by considering only 2-tuple\u0000relations.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"26 1","pages":"18:1-18:20"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78306650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-08-22DOI: 10.4230/LIPIcs.ICDT.2019.15
Grzegorz Gluch, J. Marcinkowski, Piotr Ostropolski-Nalewaja
In our paper [Gluch, Marcinkowski, Ostropolski-Nalewaja, LICS ACM, 2018] we have solved an old problem stated in [Calvanese, De Giacomo, Lenzerini, Vardi, SPDS ACM, 2000] showing that query determinacy is undecidable for Regular Path Queries. Here a strong generalisation of this result is shown, and -- we think -- a very unexpected one. We prove that no regularity is needed: determinacy remains undecidable even for finite unions of conjunctive path queries.
{"title":"The First Order Truth behind Undecidability of Regular Path Queries Determinacy","authors":"Grzegorz Gluch, J. Marcinkowski, Piotr Ostropolski-Nalewaja","doi":"10.4230/LIPIcs.ICDT.2019.15","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2019.15","url":null,"abstract":"In our paper [Gluch, Marcinkowski, Ostropolski-Nalewaja, LICS ACM, 2018] we have solved an old problem stated in [Calvanese, De Giacomo, Lenzerini, Vardi, SPDS ACM, 2000] showing that query determinacy is undecidable for Regular Path Queries. Here a strong generalisation of this result is shown, and -- we think -- a very unexpected one. We prove that no regularity is needed: determinacy remains undecidable even for finite unions of conjunctive path queries.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"103 1","pages":"15:1-15:18"},"PeriodicalIF":0.0,"publicationDate":"2018-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75059623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-01DOI: 10.4230/LIPIcs.ICDT.2020.24
Michael Simpson, Venkatesh Srinivasan, Alex Thomo
In this work, we consider misinformation propagating through a social network and study the problem of its prevention. In this problem, a "bad" campaign starts propagating from a set of seed nodes in the network and we use the notion of a limiting (or "good") campaign to counteract the effect of misinformation. The goal is to identify a set of $k$ users that need to be convinced to adopt the limiting campaign so as to minimize the number of people that adopt the "bad" campaign at the end of both propagation processes. This work presents emph{RPS} (Reverse Prevention Sampling), an algorithm that provides a scalable solution to the misinformation mitigation problem. Our theoretical analysis shows that emph{RPS} runs in $O((k + l)(n + m)(frac{1}{1 - gamma}) log n / epsilon^2 )$ expected time and returns a $(1 - 1/e - epsilon)$-approximate solution with at least $1 - n^{-l}$ probability (where $gamma$ is a typically small network parameter and $l$ is a confidence parameter). The time complexity of emph{RPS} substantially improves upon the previously best-known algorithms that run in time $Omega(m n k cdot POLY(epsilon^{-1}))$. We experimentally evaluate emph{RPS} on large datasets and show that it outperforms the state-of-the-art solution by several orders of magnitude in terms of running time. This demonstrates that misinformation mitigation can be made practical while still offering strong theoretical guarantees.
在这项工作中,我们考虑通过社交网络传播错误信息,并研究其预防问题。在这个问题中,“坏”活动从网络中的一组种子节点开始传播,我们使用限制(或“好”)活动的概念来抵消错误信息的影响。目标是确定一组需要被说服采用限制性活动的$k$用户,以便在两个传播过程结束时尽量减少采用“坏”活动的人数。这项工作提出了emph{RPS}(反向预防采样),这是一种算法,为错误信息缓解问题提供了可扩展的解决方案。我们的理论分析表明,emph{RPS}在$O((k + l)(n + m)(frac{1}{1 - gamma}) log n / epsilon^2 )$预期时间内运行,并以至少$1 - n^{-l}$的概率返回$(1 - 1/e - epsilon)$ -近似解(其中$gamma$是一个典型的小网络参数,$l$是一个置信度参数)。emph{RPS}的时间复杂度大大提高了以前最著名的实时运行算法$Omega(m n k cdot POLY(epsilon^{-1}))$。我们通过实验评估了大型数据集上的emph{RPS},并表明它在运行时间方面优于最先进的解决方案几个数量级。这表明,在提供强有力的理论保证的同时,减少错误信息是可以实现的。
{"title":"Reverse Prevention Sampling for Misinformation Mitigation in Social Networks","authors":"Michael Simpson, Venkatesh Srinivasan, Alex Thomo","doi":"10.4230/LIPIcs.ICDT.2020.24","DOIUrl":"https://doi.org/10.4230/LIPIcs.ICDT.2020.24","url":null,"abstract":"In this work, we consider misinformation propagating through a social network and study the problem of its prevention. In this problem, a \"bad\" campaign starts propagating from a set of seed nodes in the network and we use the notion of a limiting (or \"good\") campaign to counteract the effect of misinformation. The goal is to identify a set of $k$ users that need to be convinced to adopt the limiting campaign so as to minimize the number of people that adopt the \"bad\" campaign at the end of both propagation processes. \u0000This work presents emph{RPS} (Reverse Prevention Sampling), an algorithm that provides a scalable solution to the misinformation mitigation problem. Our theoretical analysis shows that emph{RPS} runs in $O((k + l)(n + m)(frac{1}{1 - gamma}) log n / epsilon^2 )$ expected time and returns a $(1 - 1/e - epsilon)$-approximate solution with at least $1 - n^{-l}$ probability (where $gamma$ is a typically small network parameter and $l$ is a confidence parameter). The time complexity of emph{RPS} substantially improves upon the previously best-known algorithms that run in time $Omega(m n k cdot POLY(epsilon^{-1}))$. We experimentally evaluate emph{RPS} on large datasets and show that it outperforms the state-of-the-art solution by several orders of magnitude in terms of running time. This demonstrates that misinformation mitigation can be made practical while still offering strong theoretical guarantees.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"3 1","pages":"24:1-24:18"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76626355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}