Proceedings of the 28th International Conference on Scientific and Statistical Database Management最新文献_第2页

Pruning Forests to Find the Trees 修剪森林寻找树木

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949697

H. Jamil

The vast majority of phylogenetic databases do not support a declarative querying platform using which their contents can be flexibly and conveniently accessed. The template based query interfaces they support do not allow arbitrary speculative queries. While a small number of graph query languages such as XQuery, Cypher and GraphQL exist for computer savvy users, most are too general and complex to be useful for biologists, and too inefficient for large phylogeny querying. In this paper, we discuss a recently introduced visual query language, called PhyQL, that leverages phylogeny specific properties to support essential and powerful constructs for a large class of phylogentic queries. Its deductive reasoner based implementation offers opportunities for a wide range of pruning strategies to speed up processing using query specific optimization and thus making it suitable for large phylogeny querying. A hybrid optimization technique that exploits a set of indices and "graphlet" partitioning is discussed. A "fail soonest" strategy is used to avoid hopeless processing and is shown to produce dividends.

绝大多数系统发育数据库不支持声明性查询平台，使用该平台可以灵活方便地访问其内容。它们支持的基于模板的查询接口不允许任意推测查询。虽然为精通计算机的用户提供了少量图形查询语言，如XQuery、Cypher和GraphQL，但大多数图形查询语言过于通用和复杂，对生物学家来说没有用处，而且对于大型系统发育查询来说效率太低。在本文中，我们讨论了最近引入的一种可视化查询语言，称为PhyQL，它利用系统发育特定的属性来支持大型系统发育查询的基本和强大结构。其基于演绎推理器的实现为广泛的修剪策略提供了机会，以使用特定于查询的优化来加速处理，从而使其适合大型系统发育查询。讨论了一种利用一组索引和“graphlet”分区的混合优化技术。“尽快失败”策略被用来避免毫无希望的处理，并被证明可以产生股息。

引用次数: 1

SolveDB: Integrating Optimization Problem Solvers Into SQL Databases SolveDB:将优化问题解决器集成到SQL数据库中

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949693

Laurynas Siksnys, T. Pedersen

Many real-world decision problems involve solving optimization problems based on data in an SQL database. Traditionally, solving such problems requires combining a DBMS with optimization software packages for each required class of problems (e.g. linear and constraint programming) -- leading to workflows that are cumbersome, complex, inefficient, and error-prone. In this paper, we present SolveDB - a DBMS for optimization applications. SolveDB supports solvers for different problem classes and offers seamless data management and optimization problem solving in a pure SQL-based setting. This allows for much simpler and more effective solutions of database-based optimization problems. SolveDB is based on the 3-level ANSI/SPARC architecture and allows formulating, solving, and analysing solutions of optimization problems using a single so-called solve query. SolveDB provides (1) an SQL-based syntax for optimization problems, (2) an extensible infrastructure for integrating different solvers, and (3) query optimization techniques to achieve the best execution performance and/or result quality. Extensive experiments with the PostgreSQL-based implementation show that SolveDB is a versatile tool offering much higher developer productivity and order of magnitude better performance for specification-complex and data-intensive problems.

许多现实世界中的决策问题都涉及到基于SQL数据库中的数据来解决优化问题。传统上，解决这类问题需要将DBMS与针对每一类问题(例如线性规划和约束规划)的优化软件包结合起来——导致工作流程繁琐、复杂、低效且容易出错。在本文中，我们提出了SolveDB -一个用于优化应用的数据库管理系统。SolveDB支持不同问题类的求解器，并在纯基于sql的设置中提供无缝的数据管理和优化问题解决。这使得基于数据库的优化问题的解决方案更加简单和有效。SolveDB基于3级ANSI/SPARC架构，允许使用一个所谓的solve查询来制定、求解和分析优化问题的解决方案。SolveDB提供(1)基于sql的语法来解决优化问题，(2)可扩展的基础结构来集成不同的求解器，以及(3)查询优化技术来实现最佳的执行性能和/或结果质量。对基于postgresql的实现的大量实验表明，SolveDB是一种通用工具，可以为开发人员提供更高的生产力，并为规范复杂和数据密集型问题提供数量级更好的性能。

{"title":"SolveDB: Integrating Optimization Problem Solvers Into SQL Databases","authors":"Laurynas Siksnys, T. Pedersen","doi":"10.1145/2949689.2949693","DOIUrl":"https://doi.org/10.1145/2949689.2949693","url":null,"abstract":"Many real-world decision problems involve solving optimization problems based on data in an SQL database. Traditionally, solving such problems requires combining a DBMS with optimization software packages for each required class of problems (e.g. linear and constraint programming) -- leading to workflows that are cumbersome, complex, inefficient, and error-prone. In this paper, we present SolveDB - a DBMS for optimization applications. SolveDB supports solvers for different problem classes and offers seamless data management and optimization problem solving in a pure SQL-based setting. This allows for much simpler and more effective solutions of database-based optimization problems. SolveDB is based on the 3-level ANSI/SPARC architecture and allows formulating, solving, and analysing solutions of optimization problems using a single so-called solve query. SolveDB provides (1) an SQL-based syntax for optimization problems, (2) an extensible infrastructure for integrating different solvers, and (3) query optimization techniques to achieve the best execution performance and/or result quality. Extensive experiments with the PostgreSQL-based implementation show that SolveDB is a versatile tool offering much higher developer productivity and order of magnitude better performance for specification-complex and data-intensive problems.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130118839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Efficient Similarity Search across Top-k Lists under the Kendall's Tau Distance 肯德尔τ距离下Top-k列表的高效相似性搜索

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949709

K. Pal, S. Michel

We consider the problem of similarity search in a set of top-k lists under the generalized Kendall's Tau distance. This distance describes how related two rankings are in terms of discordantly ordered items. We consider pair- and triplets-based indices to counter the shortcomings of naive inverted indices and derive efficient query schemes by relating the proposed index structures to the concept of locality sensitive hashing (LSH). Specifically, we devise four different LSH schemes for Kendall's Tau using two generic hash families over individual elements or pairs of them. We show that each of these functions has the desired property of being locality sensitive. Further, we discuss the selection of hash functions for the proposed LSH schemes for a given query ranking, called query-driven LSH and derive bounds for the required number of hash functions to use in order to achieve a predefined recall goal. Experimental results, using two real-world datasets, show that the devised methods outperform the SimJoin method---the state of the art method to query for similar sets---and are far superior to a plain inverted-index--based approach.

在广义Kendall’s Tau距离下，研究了top-k列表集的相似性搜索问题。这个距离描述了两个排名在不一致排序项目方面的关联程度。我们考虑基于对和三元组的索引来克服朴素倒排索引的缺点，并通过将所提出的索引结构与位置敏感哈希(LSH)的概念联系起来，得出有效的查询方案。具体来说，我们为Kendall的Tau设计了四种不同的LSH方案，使用两个通用哈希族来处理单个元素或它们对。我们证明了这些函数都具有局部敏感的理想性质。此外，我们讨论了针对给定查询排名(称为查询驱动LSH)的拟议LSH方案的哈希函数选择，并推导了为实现预定义的召回目标而使用的所需哈希函数数量的界限。使用两个真实数据集的实验结果表明，所设计的方法优于SimJoin方法(查询相似集的最先进方法)，并且远远优于普通的基于逆索引的方法。

{"title":"Efficient Similarity Search across Top-k Lists under the Kendall's Tau Distance","authors":"K. Pal, S. Michel","doi":"10.1145/2949689.2949709","DOIUrl":"https://doi.org/10.1145/2949689.2949709","url":null,"abstract":"We consider the problem of similarity search in a set of top-k lists under the generalized Kendall's Tau distance. This distance describes how related two rankings are in terms of discordantly ordered items. We consider pair- and triplets-based indices to counter the shortcomings of naive inverted indices and derive efficient query schemes by relating the proposed index structures to the concept of locality sensitive hashing (LSH). Specifically, we devise four different LSH schemes for Kendall's Tau using two generic hash families over individual elements or pairs of them. We show that each of these functions has the desired property of being locality sensitive. Further, we discuss the selection of hash functions for the proposed LSH schemes for a given query ranking, called query-driven LSH and derive bounds for the required number of hash functions to use in order to achieve a predefined recall goal. Experimental results, using two real-world datasets, show that the devised methods outperform the SimJoin method---the state of the art method to query for similar sets---and are far superior to a plain inverted-index--based approach.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129061340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Graph-based modelling of query sets for differential privacy 基于图的差分隐私查询集建模

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949695

Ali Inan, M. E. Gursoy, Emir Esmerdag, Y. Saygin

Differential privacy has gained attention from the community as the mechanism for privacy protection. Significant effort has focused on its application to data analysis, where statistical queries are submitted in batch and answers to these queries are perturbed with noise. The magnitude of this noise depends on the privacy parameter ϵ and the sensitivity of the query set. However, computing the sensitivity is known to be NP-hard. In this study, we propose a method that approximates the sensitivity of a query set. Our solution builds a query-region-intersection graph. We prove that computing the maximum clique size of this graph is equivalent to bounding the sensitivity from above. Our bounds, to the best of our knowledge, are the tightest known in the literature. Our solution currently supports a limited but expressive subset of SQL queries (i.e., range queries), and almost all popular aggregate functions directly (except AVERAGE). Experimental results show the efficiency of our approach: even for large query sets (e.g., more than 2K queries over 5 attributes), by utilizing a state-of-the-art solution for the maximum clique problem, we can approximate sensitivity in under a minute.

差分隐私作为一种隐私保护机制，受到了社会各界的关注。大量的工作集中在它在数据分析中的应用上，其中统计查询是批量提交的，这些查询的答案受到噪声的干扰。这种噪声的大小取决于隐私参数λ和查询集的灵敏度。然而，众所周知，计算灵敏度是np困难的。在这项研究中，我们提出了一种近似查询集灵敏度的方法。我们的解决方案构建了一个查询-区域-交集图。我们证明了计算该图的最大团大小相当于从上面的灵敏度边界。据我们所知，我们的界限是文献中已知的最严格的。我们的解决方案目前支持有限但很有表现力的SQL查询子集(例如，范围查询)，以及几乎所有流行的聚合函数(除了AVERAGE)。实验结果显示了我们的方法的效率:即使对于大型查询集(例如，超过5个属性的2K个查询)，通过使用最先进的解决方案来解决最大团问题，我们可以在一分钟内近似灵敏度。

{"title":"Graph-based modelling of query sets for differential privacy","authors":"Ali Inan, M. E. Gursoy, Emir Esmerdag, Y. Saygin","doi":"10.1145/2949689.2949695","DOIUrl":"https://doi.org/10.1145/2949689.2949695","url":null,"abstract":"Differential privacy has gained attention from the community as the mechanism for privacy protection. Significant effort has focused on its application to data analysis, where statistical queries are submitted in batch and answers to these queries are perturbed with noise. The magnitude of this noise depends on the privacy parameter ϵ and the sensitivity of the query set. However, computing the sensitivity is known to be NP-hard. In this study, we propose a method that approximates the sensitivity of a query set. Our solution builds a query-region-intersection graph. We prove that computing the maximum clique size of this graph is equivalent to bounding the sensitivity from above. Our bounds, to the best of our knowledge, are the tightest known in the literature. Our solution currently supports a limited but expressive subset of SQL queries (i.e., range queries), and almost all popular aggregate functions directly (except AVERAGE). Experimental results show the efficiency of our approach: even for large query sets (e.g., more than 2K queries over 5 attributes), by utilizing a state-of-the-art solution for the maximum clique problem, we can approximate sensitivity in under a minute.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115016939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Fast, Explainable View Detection to Characterize Exploration Queries 快速，可解释的视图检测，以表征探索查询

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949692

Thibault Sellam, M. Kersten

The aim of data exploration is to get acquainted with an unfamiliar database. Typically, explorers operate by trial and error: they submit a query, study the result, and refine their query subsequently. In this paper, we investigate how to help them understand their query results. In particular, we focus on medium to high dimension spaces: if the database contains dozens or hundreds of columns, which variables should they inspect? We propose to detect subspaces in which the users' selection is different from the rest of the database. From this idea, we built Ziggy, a tuple description engine. Ziggy can detect informative subspaces, and it can explain why it recommends them, with visualizations and natural language. It can cope with mixed data, missing values, and it penalizes redundancy. Our experiments reveal that it is up to an order of magnitude faster than state-of-the-art feature selection algorithms, at minimal accuracy costs.

数据探索的目的是熟悉一个不熟悉的数据库。通常，探索者通过试错操作:他们提交查询，研究结果，然后改进他们的查询。在本文中，我们研究如何帮助他们理解他们的查询结果。我们特别关注中高维度空间:如果数据库包含数十或数百列，应该检查哪些变量?我们建议检测用户选择与数据库其他部分不同的子空间。基于这个想法，我们构建了Ziggy，一个元组描述引擎。Ziggy可以检测信息丰富的子空间，并可以用可视化和自然语言解释它为什么推荐它们。它可以处理混合数据、缺失值和惩罚冗余。我们的实验表明，在最小的精度成本下，它比最先进的特征选择算法快一个数量级。

引用次数: 10

Geometric Graph Indexing for Similarity Search in Scientific Databases 科学数据库相似度检索的几何图索引

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949691

Ayser Armiti, Michael Gertz

Searching a database for similar graphs is a critical task in many scientific applications, such as in drug discovery, geoinformatics, or pattern recognition. Typically, graph edit distance is used to estimate the similarity of non-identical graphs, which is a very hard task. Several indexing structures and lower bound distances have been proposed to prune the search space. Most of them utilize the number of edit operations and assume graphs with a discrete label alphabet that has a certain canonical order. Unfortunately, such assumptions cannot be guaranteed for geometric graphs where vertices have coordinates in some two dimensional space. In this paper, we study similarity range queries for geometric graphs with edit distance constraints. First, we propose an efficient index structure to discover similar vertices. For this, we embed the vertices of different graphs in a higher dimensional space, which are then indexed using the well-known R-tree. Second, we propose three lower bound distances to filter non-similar graphs with different pruning power and complexity. Using representative geometric graphs extracted from a variety of application domains, namely chemoinformatics, character recognition, and image analysis, our framework achieved on average a pruning performance of 94% with 77% reduction in the response time.

在数据库中搜索相似的图形在许多科学应用中是一项关键任务，例如在药物发现、地理信息学或模式识别中。通常使用图编辑距离来估计非相同图的相似度，这是一项非常困难的任务。提出了几种索引结构和下界距离来精简搜索空间。它们中的大多数利用编辑操作的数量，并假设图形具有具有一定规范顺序的离散标签字母。不幸的是，对于顶点在某些二维空间中具有坐标的几何图形，不能保证这样的假设。本文研究了具有编辑距离约束的几何图的相似范围查询。首先，我们提出了一种高效的索引结构来发现相似的顶点。为此，我们将不同图的顶点嵌入到高维空间中，然后使用众所周知的r树对其进行索引。其次，我们提出了三个下界距离来过滤具有不同修剪能力和复杂度的非相似图。使用从化学信息学、字符识别和图像分析等多个应用领域提取的代表性几何图形，我们的框架实现了平均94%的修剪性能，响应时间减少了77%。

{"title":"Geometric Graph Indexing for Similarity Search in Scientific Databases","authors":"Ayser Armiti, Michael Gertz","doi":"10.1145/2949689.2949691","DOIUrl":"https://doi.org/10.1145/2949689.2949691","url":null,"abstract":"Searching a database for similar graphs is a critical task in many scientific applications, such as in drug discovery, geoinformatics, or pattern recognition. Typically, graph edit distance is used to estimate the similarity of non-identical graphs, which is a very hard task. Several indexing structures and lower bound distances have been proposed to prune the search space. Most of them utilize the number of edit operations and assume graphs with a discrete label alphabet that has a certain canonical order. Unfortunately, such assumptions cannot be guaranteed for geometric graphs where vertices have coordinates in some two dimensional space. In this paper, we study similarity range queries for geometric graphs with edit distance constraints. First, we propose an efficient index structure to discover similar vertices. For this, we embed the vertices of different graphs in a higher dimensional space, which are then indexed using the well-known R-tree. Second, we propose three lower bound distances to filter non-similar graphs with different pruning power and complexity. Using representative geometric graphs extracted from a variety of application domains, namely chemoinformatics, character recognition, and image analysis, our framework achieved on average a pruning performance of 94% with 77% reduction in the response time.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132986611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Bermuda: An Efficient MapReduce Triangle Listing Algorithm for Web-Scale Graphs 百慕大:一个有效的MapReduce三角形列表算法，用于web规模的图

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949715

Dongqing Xiao, M. Eltabakh, Xiangnan Kong

Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle listing problem has been studied in several distributed infrastructures including MapReduce. However, existing algorithms suffer from generating and shuffling huge amounts of intermediate data, where interestingly, a large percentage of this data is redundant. Inspired by this observation, we present the "Bermuda" method, an efficient MapReducebased triangle listing technique for massive graphs. Different from existing approaches, Bermuda effectively reduces the size of the intermediate data via redundancy elimination and sharing of messages whenever possible. As a result, Bermuda achieves orders-of-magnitudes of speedup and enables processing larger graphs that other techniques fail to process under the same resources. Bermuda exploits the locality of processing, i.e., in which reduce instance each graph vertex will be processed, to avoid the redundancy of generating messages from mappers to reducers. Bermuda also proposes novel message sharing techniques within each reduce instance to increase the usability of the received messages. We present and analyze several reduce-side caching strategies that dynamically learn the expected access patterns of the shared messages, and adaptively deploy the appropriate technique for better sharing. Extensive experiments conducted on real-world large-scale graphs show that Bermuda speeds up the triangle listing computations by factors up to 10x. Moreover, with a relatively small cluster, Bermuda can scale up to large datasets, e.g., ClueWeb graph dataset (688GB), while other techniques fail to finish.

三角形列表在图分析中起着重要的作用，在图挖掘中有着广泛的应用。随着图数据的快速增长，迫切需要在海量图上列出三角形的分布式方法。因此，在包括MapReduce在内的几种分布式基础架构中，对三角形列表问题进行了研究。然而，现有的算法在生成和洗牌大量中间数据方面存在问题，有趣的是，这些数据中有很大一部分是冗余的。受此启发，我们提出了“Bermuda”方法，这是一种高效的基于mapreduce的三角形列表技术，用于处理海量图。与现有的方法不同，Bermuda通过消除冗余和尽可能地共享消息，有效地减少了中间数据的大小。因此，百慕大实现了数量级的加速，并能够处理在相同资源下其他技术无法处理的更大的图形。百慕大利用了处理的局域性，即在reduce实例中每个图顶点将被处理，以避免从映射器到reducer生成消息的冗余。百慕大还在每个reduce实例中提出了新颖的消息共享技术，以提高接收到的消息的可用性。我们提出并分析了几种减少端缓存策略，这些策略动态地学习共享消息的预期访问模式，并自适应地部署适当的技术以实现更好的共享。在现实世界的大规模图表上进行的大量实验表明，百慕大将三角列表的计算速度提高了10倍。此外，由于集群相对较小，百慕大可以扩展到大型数据集，例如ClueWeb图形数据集(688GB)，而其他技术则无法完成。

{"title":"Bermuda: An Efficient MapReduce Triangle Listing Algorithm for Web-Scale Graphs","authors":"Dongqing Xiao, M. Eltabakh, Xiangnan Kong","doi":"10.1145/2949689.2949715","DOIUrl":"https://doi.org/10.1145/2949689.2949715","url":null,"abstract":"Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle listing problem has been studied in several distributed infrastructures including MapReduce. However, existing algorithms suffer from generating and shuffling huge amounts of intermediate data, where interestingly, a large percentage of this data is redundant. Inspired by this observation, we present the \"Bermuda\" method, an efficient MapReducebased triangle listing technique for massive graphs. Different from existing approaches, Bermuda effectively reduces the size of the intermediate data via redundancy elimination and sharing of messages whenever possible. As a result, Bermuda achieves orders-of-magnitudes of speedup and enables processing larger graphs that other techniques fail to process under the same resources. Bermuda exploits the locality of processing, i.e., in which reduce instance each graph vertex will be processed, to avoid the redundancy of generating messages from mappers to reducers. Bermuda also proposes novel message sharing techniques within each reduce instance to increase the usability of the received messages. We present and analyze several reduce-side caching strategies that dynamically learn the expected access patterns of the shared messages, and adaptively deploy the appropriate technique for better sharing. Extensive experiments conducted on real-world large-scale graphs show that Bermuda speeds up the triangle listing computations by factors up to 10x. Moreover, with a relatively small cluster, Bermuda can scale up to large datasets, e.g., ClueWeb graph dataset (688GB), while other techniques fail to finish.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"26 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134040556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

PIEJoin: Towards Parallel Set Containment Joins PIEJoin:朝向平行集合包含连接

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949694

Anja Kunkel, Astrid Rheinländer, C. Schiefer, S. Helmer, Panagiotis Bouros, U. Leser

The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.

集值属性上的集合包容连接(SCJ)的高效计算是一个被广泛研究的问题，在商业和科学领域都有广泛的应用。然而，仍然存在许多悬而未决的问题:广泛的比较评估仍然缺失，两种最新的算法尚未相互比较，项目排序顺序和数据属性对算法性能的确切影响在很大程度上仍然未知。此外，尽管现代服务器为并行化提供了充足的机会，但以前的所有工作都只考虑顺序连接算法。我们提出了一种基于交叉前缀树的计算SCJ的新算法PIEJoin，该算法在运行时构建在待连接属性上。我们还提出了一个高度优化的PIEJoin实现，该实现使用树签名来节省空间，并使用间隔标记来改善基本方法的运行时间。最重要的是，通过划分树的交叉点，PIEJoin可以很容易地并行化。对8个数据集的综合评估表明，连续形式的PIEJoin明显优于三个最重要的竞争对手中的两个(PRETTI和PRETTI+)。它通常比第三种方法LIMIT+(opj)慢，但并不总是慢，但需要的空间要少得多。我们在这里介绍的并行版本的PIEJoin实现了显著的进一步加速，但是我们的评估也表明，需要进一步的研究，因为找到划分连接的最佳方法是非常重要的。

{"title":"PIEJoin: Towards Parallel Set Containment Joins","authors":"Anja Kunkel, Astrid Rheinländer, C. Schiefer, S. Helmer, Panagiotis Bouros, U. Leser","doi":"10.1145/2949689.2949694","DOIUrl":"https://doi.org/10.1145/2949689.2949694","url":null,"abstract":"The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130593239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Monitoring Spatial Coverage of Trending Topics in Twitter 监测Twitter趋势主题的空间覆盖范围

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949716

Kostas Patroumpas, M. Loukadakis

Most messages posted in Twitter usually discuss an ongoing event, triggering a series of tweets that together may constitute a trending topic (e.g., #election2012, #jesuischarlie, #oscars2016). Sometimes, such a topic may be trending only locally, assuming that related posts have a geographical reference, either directly geotagging them with exact coordinates or indirectly by mentioning a well-known landmark (e.g., #bataclan). In this paper, we study how trending topics evolve both in space and time, by monitoring the Twitter stream and detecting online the varying spatial coverage of related geotagged posts across time. Observing the evolving spatial coverage of such posts may reveal the intensity of a phenomenon and its impact on local communities, and can further assist in improving user awareness on facts and situations with strong local footprint. We propose a technique that can maintain trending topics and readily recognize their locality by subdividing the area of interest into elementary cells. Thus, instead of costly spatial clustering of incoming messages by topic, we can approximately, but almost instantly, identify such areas of coverage as groups of contiguous cells, as well as their mutability with time. We conducted a comprehensive empirical study to evaluate the performance of the proposed methodology, as well as the quality of detected areas of coverage. Results confirm that our technique can efficiently cope with scalable volumes of messages, offering incremental response in real-time regarding coverage updates for trending topics.

Twitter上发布的大多数消息通常讨论正在进行的事件，引发一系列推文，这些推文可能共同构成一个热门话题(例如，#election2012， #jesuischarlie， #oscars2016)。有时，这样的话题可能只在当地流行，假设相关的帖子有地理参考，要么直接用精确的坐标对它们进行地理标记，要么间接提到一个著名的地标(例如，#bataclan)。在本文中，我们通过监测Twitter流和在线检测相关地理标记帖子随时间变化的空间覆盖，研究趋势话题在空间和时间上的演变。观察这些哨所的空间覆盖范围的变化，可以揭示一种现象的强度及其对当地社区的影响，并可以进一步帮助提高用户对具有强烈当地影响的事实和情况的认识。我们提出了一种技术，可以通过将感兴趣的区域细分为基本细胞来维护趋势话题并容易识别其位置。因此，我们可以近似地，但几乎是立即地，将这些覆盖区域识别为一组连续的单元，以及它们随时间的可变性，而不是按主题对传入消息进行昂贵的空间聚类。我们进行了全面的实证研究，以评估所提出的方法的性能，以及检测覆盖区域的质量。结果证实，我们的技术可以有效地处理可扩展的消息量，提供关于趋势主题的覆盖更新的实时增量响应。

{"title":"Monitoring Spatial Coverage of Trending Topics in Twitter","authors":"Kostas Patroumpas, M. Loukadakis","doi":"10.1145/2949689.2949716","DOIUrl":"https://doi.org/10.1145/2949689.2949716","url":null,"abstract":"Most messages posted in Twitter usually discuss an ongoing event, triggering a series of tweets that together may constitute a trending topic (e.g., #election2012, #jesuischarlie, #oscars2016). Sometimes, such a topic may be trending only locally, assuming that related posts have a geographical reference, either directly geotagging them with exact coordinates or indirectly by mentioning a well-known landmark (e.g., #bataclan). In this paper, we study how trending topics evolve both in space and time, by monitoring the Twitter stream and detecting online the varying spatial coverage of related geotagged posts across time. Observing the evolving spatial coverage of such posts may reveal the intensity of a phenomenon and its impact on local communities, and can further assist in improving user awareness on facts and situations with strong local footprint. We propose a technique that can maintain trending topics and readily recognize their locality by subdividing the area of interest into elementary cells. Thus, instead of costly spatial clustering of incoming messages by topic, we can approximately, but almost instantly, identify such areas of coverage as groups of contiguous cells, as well as their mutability with time. We conducted a comprehensive empirical study to evaluate the performance of the proposed methodology, as well as the quality of detected areas of coverage. Results confirm that our technique can efficiently cope with scalable volumes of messages, offering incremental response in real-time regarding coverage updates for trending topics.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134573761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SPECTRA: Continuous Query Processing for RDF Graph Streams Over Sliding Windows 通过滑动窗口的RDF图流的连续查询处理

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Pub Date : 2016-07-18 DOI: 10.1145/2949689.2949701

Syed Gillani, Gauthier Picard, F. Laforest

This paper proposes a new approach for the the incremental evaluation of RDF graph streams over sliding windows. Our system, called "SPECTRA", combines a novel formof RDF graph summarisation, a new incremental evaluation method and adaptive indexing techniques. We materialise the summarised graph from each event using vertically partitioned views to facilitate the fast hash-joins for all types of queries. Our incremental and adaptive indexing is a byproduct of query processing, and thus provides considerable advantages over offline and online indexing. Furthermore, contrary to the existing approaches, we employ incremental evaluation of triples within a window. This results in considerable reduction in response time, while cutting the unnecessary cost imposed by recomputation models for each triple insertion and eviction within a defined window. We show that our resulting system is able to cope with complex queries and datasets with clear benefits. Our experimental results on both synthetic and real-world datasets show up to an order of magnitude of performance improvements as compared to state-of-the-art systems.

本文提出了一种滑动窗口上RDF图流增量评估的新方法。我们的系统，称为“SPECTRA”，结合了一种新的RDF图摘要形式，一种新的增量评估方法和自适应索引技术。我们使用垂直分区视图实现每个事件的汇总图，以方便所有类型查询的快速散列连接。我们的增量和自适应索引是查询处理的副产品，因此比离线和在线索引提供了相当大的优势。此外，与现有方法相反，我们在窗口内使用三元组的增量计算。这大大减少了响应时间，同时减少了定义窗口内每次三重插入和取出的重新计算模型所带来的不必要的成本。我们表明，我们的结果系统能够处理复杂的查询和数据集，并具有明显的优势。我们在合成和真实世界数据集上的实验结果显示，与最先进的系统相比，性能提高了一个数量级。

{"title":"SPECTRA: Continuous Query Processing for RDF Graph Streams Over Sliding Windows","authors":"Syed Gillani, Gauthier Picard, F. Laforest","doi":"10.1145/2949689.2949701","DOIUrl":"https://doi.org/10.1145/2949689.2949701","url":null,"abstract":"This paper proposes a new approach for the the incremental evaluation of RDF graph streams over sliding windows. Our system, called \"SPECTRA\", combines a novel formof RDF graph summarisation, a new incremental evaluation method and adaptive indexing techniques. We materialise the summarised graph from each event using vertically partitioned views to facilitate the fast hash-joins for all types of queries. Our incremental and adaptive indexing is a byproduct of query processing, and thus provides considerable advantages over offline and online indexing. Furthermore, contrary to the existing approaches, we employ incremental evaluation of triples within a window. This results in considerable reduction in response time, while cutting the unnecessary cost imposed by recomputation models for each triple insertion and eviction within a defined window. We show that our resulting system is able to cope with complex queries and datasets with clear benefits. Our experimental results on both synthetic and real-world datasets show up to an order of magnitude of performance improvements as compared to state-of-the-art systems.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129190894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4