The probabilistic relation model has been used for the compact representation of uncertain data in relational databases. In this paper we present the extended probabilistic relation model, a compact representation for uncertain information that admits efficient information integration. We present an algorithm for data integration using this model and prove its correctness. We also explore the complexity of query evaluation under the probabilistic and extended probabilistic models. Finally, we study the problem of obtaining a (pure) probabilistic relation that is equivalent to a given extended probabilistic relation, and present approaches and algorithms for this task. This work is the first and critical step towards practical and efficient uncertain information integration.
{"title":"A compact representation for efficient uncertain-information integration","authors":"Amir Dayyan Borhanian, F. Sadri","doi":"10.1145/2513591.2513638","DOIUrl":"https://doi.org/10.1145/2513591.2513638","url":null,"abstract":"The probabilistic relation model has been used for the compact representation of uncertain data in relational databases. In this paper we present the extended probabilistic relation model, a compact representation for uncertain information that admits efficient information integration. We present an algorithm for data integration using this model and prove its correctness. We also explore the complexity of query evaluation under the probabilistic and extended probabilistic models. Finally, we study the problem of obtaining a (pure) probabilistic relation that is equivalent to a given extended probabilistic relation, and present approaches and algorithms for this task. This work is the first and critical step towards practical and efficient uncertain information integration.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"20 1","pages":"122-131"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78778942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Taddeo, Alberto Trombetta, D. Montesi, S. Pierantozzi
The management of legal domains is gaining great importance in the context of data management. In fact, the geographical distribution of data as implied -- for example -- by cloud-based services requires that the legal restrictions and obligations are to be taken into account whenever data circulates across different legal domains. In this paper, we start to investigate an approach for coping with the complex issues that arise when dealing with data spanning different legal domains. Our approach consists of a conceptual model that takes into account the notion of legal domain (to be paired with the corresponding data) and a reference architecture for implementing our approach in an actual relational DBMS.
{"title":"Querying data across different legal domains","authors":"Marco Taddeo, Alberto Trombetta, D. Montesi, S. Pierantozzi","doi":"10.1145/2513591.2513642","DOIUrl":"https://doi.org/10.1145/2513591.2513642","url":null,"abstract":"The management of legal domains is gaining great importance in the context of data management. In fact, the geographical distribution of data as implied -- for example -- by cloud-based services requires that the legal restrictions and obligations are to be taken into account whenever data circulates across different legal domains. In this paper, we start to investigate an approach for coping with the complex issues that arise when dealing with data spanning different legal domains. Our approach consists of a conceptual model that takes into account the notion of legal domain (to be paired with the corresponding data) and a reference architecture for implementing our approach in an actual relational DBMS.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"11 1","pages":"192-197"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81939390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Linked Data Benchmark Council (LDBC) is an EU project that aims to develop industry-strength benchmarks for graph and RDF data management systems. LDBC introduces a so-called "choke-point" based benchmark development, through which experts identify key technical challenges, and introduce them in the benchmark workload, which we describe in some detail. We also present the status of two LDBC benchmarks currently in development, one targeting graph data management systems using a social network data case, and the other targeting RDF systems using a data publishing case.
{"title":"LDBC: benchmarks for graph and RDF data management","authors":"P. Boncz","doi":"10.1145/2513591.2527070","DOIUrl":"https://doi.org/10.1145/2513591.2527070","url":null,"abstract":"The Linked Data Benchmark Council (LDBC) is an EU project that aims to develop industry-strength benchmarks for graph and RDF data management systems. LDBC introduces a so-called \"choke-point\" based benchmark development, through which experts identify key technical challenges, and introduce them in the benchmark workload, which we describe in some detail. We also present the status of two LDBC benchmarks currently in development, one targeting graph data management systems using a social network data case, and the other targeting RDF systems using a data publishing case.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"243 1","pages":"1-2"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74729162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, a great deal of interest for Big Data has risen, mainly driven from a widespread number of research problems strongly related to real-life applications and systems, such as representing, modeling, processing, querying and mining massive, distributed, large-scale repositories (mostly being of unstructured nature). Inspired by this main trend, in this paper we discuss three important aspects of Big Data research, namely OLAP over Big Data, Big Data Posting, and Privacy of Big Data. We also depict future research directions, hence implicitly defining a research agenda aiming at leading future challenges in this research field.
{"title":"Big data: a research agenda","authors":"A. Cuzzocrea, D. Saccá, J. Ullman","doi":"10.1145/2513591.2527071","DOIUrl":"https://doi.org/10.1145/2513591.2527071","url":null,"abstract":"Recently, a great deal of interest for Big Data has risen, mainly driven from a widespread number of research problems strongly related to real-life applications and systems, such as representing, modeling, processing, querying and mining massive, distributed, large-scale repositories (mostly being of unstructured nature). Inspired by this main trend, in this paper we discuss three important aspects of Big Data research, namely OLAP over Big Data, Big Data Posting, and Privacy of Big Data. We also depict future research directions, hence implicitly defining a research agenda aiming at leading future challenges in this research field.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"44 1","pages":"198-203"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89173190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Graefe, Ilia Petrov, Todor Ivanov, Veselin Marinov
The paper explores a hybrid page layout (HPL), combining the advantages of NSM and PAX. The design defines a continuum between NSM and PAX supporting both efficient scans minimizing cache faults and efficient insertions and updates. Our evaluation shows that HPL fills the PAX-NSM performance gap.
{"title":"A hybrid page layout integrating PAX and NSM","authors":"G. Graefe, Ilia Petrov, Todor Ivanov, Veselin Marinov","doi":"10.1145/2513591.2513643","DOIUrl":"https://doi.org/10.1145/2513591.2513643","url":null,"abstract":"The paper explores a hybrid page layout (HPL), combining the advantages of NSM and PAX. The design defines a continuum between NSM and PAX supporting both efficient scans minimizing cache faults and efficient insertions and updates. Our evaluation shows that HPL fills the PAX-NSM performance gap.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"22 1","pages":"86-95"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89582540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Highly efficient query processing on high-dimensional data, while important, is still a challenge nowadays -- as the curse of dimensionality makes efficient solution very difficult. On the other hand, there have been suggestions that it is better off if one can return a solution quickly, that is close enough, to be sufficient. In this paper we will introduce the concept R-Forest, comprised of a set of disjoint R-trees built over the domain of the search space. Each R-tree will store a sub-set of points in a non-overlapping space, which is maintained throughout the life of the forest. Also included are several new features, Median point used for ordering and searching a pruning parameter, as well as restricted access. When all of these are combined together they can be used to answer Approximate Nearest Neighbor queries, returning a result that is an improvement over alternative methods, such as Locality Sensitive Hashing B-Tree (LSB-tree) with the same amount of IO. With our approach to this difficult problem, we are able to handle different data distribution, even taking advantage of the distribution without any additional parameter tuning, scales with increasing dimensionality and most importantly provides the user with some feedback, in terms of lower bound as to the quality of the results.
{"title":"Approximate high-dimensional nearest neighbor queries using R-forests","authors":"Michael Nolen, King-Ip Lin","doi":"10.1145/2513591.2513652","DOIUrl":"https://doi.org/10.1145/2513591.2513652","url":null,"abstract":"Highly efficient query processing on high-dimensional data, while important, is still a challenge nowadays -- as the curse of dimensionality makes efficient solution very difficult. On the other hand, there have been suggestions that it is better off if one can return a solution quickly, that is close enough, to be sufficient. In this paper we will introduce the concept R-Forest, comprised of a set of disjoint R-trees built over the domain of the search space. Each R-tree will store a sub-set of points in a non-overlapping space, which is maintained throughout the life of the forest. Also included are several new features, Median point used for ordering and searching a pruning parameter, as well as restricted access. When all of these are combined together they can be used to answer Approximate Nearest Neighbor queries, returning a result that is an improvement over alternative methods, such as Locality Sensitive Hashing B-Tree (LSB-tree) with the same amount of IO. With our approach to this difficult problem, we are able to handle different data distribution, even taking advantage of the distribution without any additional parameter tuning, scales with increasing dimensionality and most importantly provides the user with some feedback, in terms of lower bound as to the quality of the results.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"16 1","pages":"48-57"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75128120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many applications, it is convenient to substitute a large data graph with a smaller homomorphic graph. This paper investigates approaches for summarising massive data graphs. In general, massive data graphs are processed using a shared-nothing infrastructure such as MapReduce. However, accurate graph summarisation algorithms are suboptimal for this kind of environment as they require multiple iterations over the data graph. We investigate approximate graph summarisation algorithms that are efficient to compute in a shared-nothing infrastructure. We define a quality assessment model of a summary with regards to a gold standard summary. We evaluate over several datasets the trade-offs between efficiency and precision of the algorithms. With regards to an application, experiments highlight the need to trade-off the precision and volume of a graph summary with the complexity of a summarisation technique.
{"title":"Efficiency and precision trade-offs in graph summary algorithms","authors":"S. Campinas, Renaud Delbru, G. Tummarello","doi":"10.1145/2513591.2513654","DOIUrl":"https://doi.org/10.1145/2513591.2513654","url":null,"abstract":"In many applications, it is convenient to substitute a large data graph with a smaller homomorphic graph. This paper investigates approaches for summarising massive data graphs. In general, massive data graphs are processed using a shared-nothing infrastructure such as MapReduce. However, accurate graph summarisation algorithms are suboptimal for this kind of environment as they require multiple iterations over the data graph. We investigate approximate graph summarisation algorithms that are efficient to compute in a shared-nothing infrastructure. We define a quality assessment model of a summary with regards to a gold standard summary. We evaluate over several datasets the trade-offs between efficiency and precision of the algorithms. With regards to an application, experiments highlight the need to trade-off the precision and volume of a graph summary with the complexity of a summarisation technique.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"26 1","pages":"38-47"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74846705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skyline queries were introduced to formulate multi-criteria searches. Such a query tries to select in a relation the tuples that optimize all the criteria, called dominant tuples. There rarely exists a single dominant tuple, but usually a set of incomparable ones, the skyline set. Unfortunately, the deterioration of the query (the size of its answer) increases proportionally with the number of criteria. To address this limitation, we propose a flexible approach to categorize and refine the skyline set by applying successive relaxations of the dominance conditions with respect to user's preferences. Our approach, called θ-skyline, is based on decision theory which deals with decision-making in the presence of conflicting choices. We also define global ranking method over the skyline set.
{"title":"Personalized progressive filtering of skyline queries in high dimensional spaces","authors":"Yann Loyer, Isma Sadoun, K. Zeitouni","doi":"10.1145/2513591.2513646","DOIUrl":"https://doi.org/10.1145/2513591.2513646","url":null,"abstract":"Skyline queries were introduced to formulate multi-criteria searches. Such a query tries to select in a relation the tuples that optimize all the criteria, called dominant tuples. There rarely exists a single dominant tuple, but usually a set of incomparable ones, the skyline set. Unfortunately, the deterioration of the query (the size of its answer) increases proportionally with the number of criteria. To address this limitation, we propose a flexible approach to categorize and refine the skyline set by applying successive relaxations of the dominance conditions with respect to user's preferences. Our approach, called θ-skyline, is based on decision theory which deals with decision-making in the presence of conflicting choices. We also define global ranking method over the skyline set.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"79 1","pages":"186-191"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72913974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation.
{"title":"Sequential pattern mining from trajectory data","authors":"E. Masciari, Barzan Mozafari","doi":"10.1145/2513591.2513653","DOIUrl":"https://doi.org/10.1145/2513591.2513653","url":null,"abstract":"In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"44 1","pages":"162-167"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75613774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The existing heuristics for top-k join queries, aiming to minimize the scan-depth, rely heavily on scores and correlation of scores. It is known that for uniformly random scores between two relations of length n, scan-depth of √kn is required. Moreover, optimizing multiple criteria of selections that are anti-correlated may require scan-depth up to (n + k)/2. We build a linear space index, which in anticipation of worst-case queries maintains a subset of answers. Based on this, we achieve Õ(√kn) join trials i.e., average case performance even for the worst-case queries. The experimental evaluation shows superior performance against the well-known Rank-Join algorithm.
{"title":"Top-k join queries: overcoming the curse of anti-correlation","authors":"Manish Patil, R. Shah, Sharma V. Thankachan","doi":"10.1145/2513591.2513645","DOIUrl":"https://doi.org/10.1145/2513591.2513645","url":null,"abstract":"The existing heuristics for top-k join queries, aiming to minimize the scan-depth, rely heavily on scores and correlation of scores. It is known that for uniformly random scores between two relations of length n, scan-depth of √kn is required. Moreover, optimizing multiple criteria of selections that are anti-correlated may require scan-depth up to (n + k)/2. We build a linear space index, which in anticipation of worst-case queries maintains a subset of answers. Based on this, we achieve Õ(√kn) join trials i.e., average case performance even for the worst-case queries. The experimental evaluation shows superior performance against the well-known Rank-Join algorithm.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"136 1","pages":"76-85"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74488203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}