When querying an Resource Description Framework (RDF) graph, a prominent feature is the possibility of extending the answer to a query with optional information. However, the definition of this feature in SPARQL—the standard RDF query language—has raised some important issues. Most notably, the use of this feature increases the complexity of the evaluation problem, and its closed-world semantics is in conflict with the underlying open-world semantics of RDF. Many approaches for fixing such problems have been proposed, the most prominent being the introduction of the semantic notion of weakly monotone SPARQL query. Weakly monotone SPARQL queries have shaped the class of queries that conform to the open-world semantics of RDF. Unfortunately, finding an effective way of restricting SPARQL to the fragment of weakly monotone queries has proven to be an elusive problem. In practice, the most widely adopted fragment for writing SPARQL queries is based on the syntactic notion of well designedness. This notion has proven to be a good approach for writing SPARQL queries, but its expressive power has yet to be fully understood. The starting point of this article is to understand the relation between well-designed queries and the semantic notion of weak monotonicity. It is known that every well-designed SPARQL query is weakly monotone; as our first contribution we prove that the converse does not hold, even if an extension of this notion based on the use of disjunction is considered. Given this negative result, we embark on the task of defining syntactic fragments that are weakly monotone and have higher expressive power than the fragment of well-designed queries. To this end, we move to a more general scenario where infinite RDF graphs are also allowed, so interpolation techniques studied for first-order logic can be applied. With the use of these techniques, we are able to define a new operator for SPARQL that gives rise to a query language with the desired properties (over finite and infinite RDF graphs). It should be noticed that every query in this fragment is weakly monotone if we restrict the semantics to finite RDF graphs. Moreover, we use this result to provide a simple characterization of the class of monotone CONSTRUCT queries, that is, the class of SPARQL queries that produce RDF graphs as output. Finally, we pinpoint the complexity of the evaluation problem for the query languages identified in the article.
{"title":"Designing a Query Language for RDF","authors":"M. Arenas, M. Ugarte","doi":"10.1145/3129247","DOIUrl":"https://doi.org/10.1145/3129247","url":null,"abstract":"When querying an Resource Description Framework (RDF) graph, a prominent feature is the possibility of extending the answer to a query with optional information. However, the definition of this feature in SPARQL—the standard RDF query language—has raised some important issues. Most notably, the use of this feature increases the complexity of the evaluation problem, and its closed-world semantics is in conflict with the underlying open-world semantics of RDF. Many approaches for fixing such problems have been proposed, the most prominent being the introduction of the semantic notion of weakly monotone SPARQL query. Weakly monotone SPARQL queries have shaped the class of queries that conform to the open-world semantics of RDF. Unfortunately, finding an effective way of restricting SPARQL to the fragment of weakly monotone queries has proven to be an elusive problem. In practice, the most widely adopted fragment for writing SPARQL queries is based on the syntactic notion of well designedness. This notion has proven to be a good approach for writing SPARQL queries, but its expressive power has yet to be fully understood. The starting point of this article is to understand the relation between well-designed queries and the semantic notion of weak monotonicity. It is known that every well-designed SPARQL query is weakly monotone; as our first contribution we prove that the converse does not hold, even if an extension of this notion based on the use of disjunction is considered. Given this negative result, we embark on the task of defining syntactic fragments that are weakly monotone and have higher expressive power than the fragment of well-designed queries. To this end, we move to a more general scenario where infinite RDF graphs are also allowed, so interpolation techniques studied for first-order logic can be applied. With the use of these techniques, we are able to define a new operator for SPARQL that gives rise to a query language with the desired properties (over finite and infinite RDF graphs). It should be noticed that every query in this fragment is weakly monotone if we restrict the semantics to finite RDF graphs. Moreover, we use this result to provide a simple characterization of the class of monotone CONSTRUCT queries, that is, the class of SPARQL queries that produce RDF graphs as output. Finally, we pinpoint the complexity of the evaluation problem for the query languages identified in the article.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"144 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86755637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, D. Srivastava, Xiaokui Xiao
Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless. To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.
{"title":"PrivBayes","authors":"Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, D. Srivastava, Xiaokui Xiao","doi":"10.1145/3134428","DOIUrl":"https://doi.org/10.1145/3134428","url":null,"abstract":"Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless. To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"2 1","pages":"1 - 41"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86978669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Brijder, Floris Geerts, J. V. D. Bussche, Timmy Weerwag
We investigate the expressive power of MATLANG, a formal language for matrix manipulation based on common matrix operations and linear algebra. The language can be extended with the operation inv for inverting a matrix. In MATLANG + inv, we can compute the transitive closure of directed graphs, whereas we show that this is not possible without inversion. Indeed, we show that the basic language can be simulated in the relational algebra with arithmetic operations, grouping, and summation. We also consider an operation eigen for diagonalizing a matrix. It is defined such that for each eigenvalue a set of mutually orthogonal eigenvectors is returned that span the eigenspace of that eigenvalue. We show that inv can be expressed in MATLANG + eigen. We put forward the open question whether there are Boolean queries about matrices, or generic queries about graphs, expressible in MATLANG + eigen but not in MATLANG + inv. Finally, the evaluation problem for MATLANG + eigen is shown to be complete for the complexity class ∃ R.
{"title":"On the Expressive Power of Query Languages for Matrices","authors":"R. Brijder, Floris Geerts, J. V. D. Bussche, Timmy Weerwag","doi":"10.1145/3331445","DOIUrl":"https://doi.org/10.1145/3331445","url":null,"abstract":"We investigate the expressive power of MATLANG, a formal language for matrix manipulation based on common matrix operations and linear algebra. The language can be extended with the operation inv for inverting a matrix. In MATLANG + inv, we can compute the transitive closure of directed graphs, whereas we show that this is not possible without inversion. Indeed, we show that the basic language can be simulated in the relational algebra with arithmetic operations, grouping, and summation. We also consider an operation eigen for diagonalizing a matrix. It is defined such that for each eigenvalue a set of mutually orthogonal eigenvectors is returned that span the eigenspace of that eigenvalue. We show that inv can be expressed in MATLANG + eigen. We put forward the open question whether there are Boolean queries about matrices, or generic queries about graphs, expressible in MATLANG + eigen but not in MATLANG + inv. Finally, the evaluation problem for MATLANG + eigen is shown to be complete for the complexity class ∃ R.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 31"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82784322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Martens, F. Neven, Matthias Niewerth, Thomas Schwentick
While the migration from DTD to XML Schema was driven by a need for increased expressivity and flexibility, the latter was also significantly more complex to use and understand. Whereas DTDs are characterized by their simplicity, XML Schema Documents are notoriously difficult. In this article, we introduce the XML specification language BonXai, which incorporates many features of XML Schema but is arguably almost as easy to use as DTDs. In brief, the latter is achieved by sacrificing the explicit use of types in favor of simple patterns expressing contexts for elements. The goal of BonXai is not to replace XML Schema but rather to provide a simpler alternative for users who want to go beyond the expressiveness and features of DTD but do not need the explicit use of types. Furthermore, XML Schema processing tools can be used as a back-end for BonXai, since BonXai can be automatically converted into XML Schema. A particularly strong point of BonXai is its solid foundation rooted in a decade of theoretical work around pattern-based schemas. We present a formal model for a core fragment of BonXai and the translation algorithms to and from a core fragment of XML Schema. We prove that BonXai and XML Schema can be converted back-and-forth on the level of tree languages and we formally study the size trade-offs between the two languages.
{"title":"BonXai","authors":"W. Martens, F. Neven, Matthias Niewerth, Thomas Schwentick","doi":"10.1145/3105960","DOIUrl":"https://doi.org/10.1145/3105960","url":null,"abstract":"While the migration from DTD to XML Schema was driven by a need for increased expressivity and flexibility, the latter was also significantly more complex to use and understand. Whereas DTDs are characterized by their simplicity, XML Schema Documents are notoriously difficult. In this article, we introduce the XML specification language BonXai, which incorporates many features of XML Schema but is arguably almost as easy to use as DTDs. In brief, the latter is achieved by sacrificing the explicit use of types in favor of simple patterns expressing contexts for elements. The goal of BonXai is not to replace XML Schema but rather to provide a simpler alternative for users who want to go beyond the expressiveness and features of DTD but do not need the explicit use of types. Furthermore, XML Schema processing tools can be used as a back-end for BonXai, since BonXai can be automatically converted into XML Schema. A particularly strong point of BonXai is its solid foundation rooted in a decade of theoretical work around pattern-based schemas. We present a formal model for a core fragment of BonXai and the translation algorithms to and from a core fragment of XML Schema. We prove that BonXai and XML Schema can be converted back-and-forth on the level of tree languages and we formally study the size trade-offs between the two languages.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"120 1","pages":"1 - 42"},"PeriodicalIF":0.0,"publicationDate":"2017-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76658356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Answering aggregate queries is a key requirement of emerging applications of Semantic Technologies, such as data warehousing, business intelligence, and sensor networks. To fulfil the requirements of such applications, the standardization of SPARQL 1.1 led to the introduction of a wide range of constructs that enable value computation, aggregation, and query nesting. In this article, we provide an in-depth formal analysis of the semantics and expressive power of these new constructs as defined in the SPARQL 1.1 specification, and hence lay the necessary foundations for the development of robust, scalable, and extensible query engines supporting complex numerical and analytics tasks.
{"title":"Query Nesting, Assignment, and Aggregation in SPARQL 1.1","authors":"M. Kaminski, Egor V. Kostylev, B. C. Grau","doi":"10.1145/3083898","DOIUrl":"https://doi.org/10.1145/3083898","url":null,"abstract":"Answering aggregate queries is a key requirement of emerging applications of Semantic Technologies, such as data warehousing, business intelligence, and sensor networks. To fulfil the requirements of such applications, the standardization of SPARQL 1.1 led to the introduction of a wide range of constructs that enable value computation, aggregation, and query nesting. In this article, we provide an in-depth formal analysis of the semantics and expressive power of these new constructs as defined in the SPARQL 1.1 specification, and hence lay the necessary foundations for the development of robust, scalable, and extensible query engines supporting complex numerical and analytics tasks.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"48 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2017-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85541104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O(n log n)-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since. This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω(n 4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ-approximate DBSCAN, which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O(n) expected time regardless of the constant dimensionality d. The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O(n) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O(n log log n) time deterministically, improving the existing O(n log n) bound.
{"title":"On the Hardness and Approximation of Euclidean DBSCAN","authors":"Junhao Gan, Yufei Tao","doi":"10.1145/3083897","DOIUrl":"https://doi.org/10.1145/3083897","url":null,"abstract":"DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O(n log n)-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since. This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω(n 4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ-approximate DBSCAN, which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O(n) expected time regardless of the constant dimensionality d. The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O(n) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O(n log log n) time deterministically, improving the existing O(n log n) bound.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"43 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89926980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian Tschirschnitz, Thorsten Papenbrock, Felix Naumann
Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables. With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.
{"title":"Detecting Inclusion Dependencies on Very Many Tables","authors":"Fabian Tschirschnitz, Thorsten Papenbrock, Felix Naumann","doi":"10.1145/3105959","DOIUrl":"https://doi.org/10.1145/3105959","url":null,"abstract":"Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables. With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"93 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73280772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erich Schubert, J. Sander, M. Ester, H. Kriegel, Xiaowei Xu
At SIGMOD 2015, an article was presented with the title “DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation” that won the conference’s best paper award. In this technical correspondence, we want to point out some inaccuracies in the way DBSCAN was represented, and why the criticism should have been directed at the assumption about the performance of spatial index structures such as R-trees and not at an algorithm that can use such indexes. We will also discuss the relationship of DBSCAN performance and the indexability of the dataset, and discuss some heuristics for choosing appropriate DBSCAN parameters. Some indicators of bad parameters will be proposed to help guide future users of this algorithm in choosing parameters such as to obtain both meaningful results and good performance. In new experiments, we show that the new SIGMOD 2015 methods do not appear to offer practical benefits if the DBSCAN parameters are well chosen and thus they are primarily of theoretical interest. In conclusion, the original DBSCAN algorithm with effective indexes and reasonably chosen parameter values performs competitively compared to the method proposed by Gan and Tao.
{"title":"DBSCAN Revisited, Revisited","authors":"Erich Schubert, J. Sander, M. Ester, H. Kriegel, Xiaowei Xu","doi":"10.1145/3068335","DOIUrl":"https://doi.org/10.1145/3068335","url":null,"abstract":"At SIGMOD 2015, an article was presented with the title “DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation” that won the conference’s best paper award. In this technical correspondence, we want to point out some inaccuracies in the way DBSCAN was represented, and why the criticism should have been directed at the assumption about the performance of spatial index structures such as R-trees and not at an algorithm that can use such indexes. We will also discuss the relationship of DBSCAN performance and the indexability of the dataset, and discuss some heuristics for choosing appropriate DBSCAN parameters. Some indicators of bad parameters will be proposed to help guide future users of this algorithm in choosing parameters such as to obtain both meaningful results and good performance. In new experiments, we show that the new SIGMOD 2015 methods do not appear to offer practical benefits if the DBSCAN parameters are well chosen and thus they are primarily of theoretical interest. In conclusion, the original DBSCAN algorithm with effective indexes and reasonably chosen parameter values performs competitively compared to the method proposed by Gan and Tao.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"17 1","pages":"1 - 21"},"PeriodicalIF":0.0,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81710955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V. To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover + vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover + vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.
{"title":"Efficient SimRank-Based Similarity Join","authors":"Weiguo Zheng, Lei Zou, Lei Chen, Dongyan Zhao","doi":"10.1145/3083899","DOIUrl":"https://doi.org/10.1145/3083899","url":null,"abstract":"Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V. To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover + vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover + vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"16 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90377495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A relational database is said to be uncertain if primary key constraints can possibly be violated. A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value. For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db as input and asks whether q is true in every repair of db. The complexity of this problem has been particularly studied for q ranging over the class of self-join-free Boolean conjunctive queries. A research challenge is to determine, given q, whether CERTAINTY(q) belongs to complexity classes FO, P, or coNP-complete. In this article, we combine existing techniques for studying this complexity classification task. We show that, for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO. We additionally show how to construct a single SQL query for solving CERTAINTY(q) if it is in FO. Further, for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is either in P or coNP-complete and the complexity dichotomy is effective. This settles a research question that has been open for 10 years.
{"title":"Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints","authors":"Paraschos Koutris, J. Wijsen","doi":"10.1145/3068334","DOIUrl":"https://doi.org/10.1145/3068334","url":null,"abstract":"A relational database is said to be uncertain if primary key constraints can possibly be violated. A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value. For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db as input and asks whether q is true in every repair of db. The complexity of this problem has been particularly studied for q ranging over the class of self-join-free Boolean conjunctive queries. A research challenge is to determine, given q, whether CERTAINTY(q) belongs to complexity classes FO, P, or coNP-complete. In this article, we combine existing techniques for studying this complexity classification task. We show that, for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO. We additionally show how to construct a single SQL query for solving CERTAINTY(q) if it is in FO. Further, for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is either in P or coNP-complete and the complexity dichotomy is effective. This settles a research question that has been open for 10 years.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"38 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79366834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}