Proceedings of the 2016 International Conference on Management of Data最新文献_第7页

Adaptive Data Skipping in Main-Memory Systems 主存系统中的自适应数据跳变

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2914836

Wilson Qin, Stratos Idreos

As modern main-memory optimized data systems increasingly rely on fast scans, lightweight indexes that allow for data skipping play a crucial role in data filtering to reduce system I/O. Scans benefit from data skipping when the data order is sorted, semi-sorted, or comprised of clustered values. However data skipping loses effectiveness over arbitrary data distributions. Applying data skipping techniques over non-sorted data can significantly decrease query performance since the extra cost of metadata reads result in no corresponding scan performance gains. We introduce adaptive data skipping as a framework for structures and techniques that respond to a vast array of data distributions and query workloads. We reveal an adaptive zonemaps design and implementation on a main-memory column store prototype to demonstrate that adaptive data skipping has potential for 1.4X speedup.

由于现代主存优化的数据系统越来越依赖于快速扫描，允许数据跳过的轻量级索引在数据过滤中起着至关重要的作用，可以减少系统I/O。当数据顺序排序、半排序或由聚集值组成时，扫描受益于数据跳过。然而，数据跳变在任意数据分布中失去了有效性。在未排序的数据上应用数据跳过技术会显著降低查询性能，因为元数据读取的额外成本不会带来相应的扫描性能提升。我们将自适应数据跳转作为响应大量数据分布和查询工作负载的结构和技术框架引入。我们在一个主存列存储原型上展示了一个自适应区域地图的设计和实现，以证明自适应数据跳转具有1.4倍加速的潜力。

引用次数: 17

Towards a Non-2PC Transaction Management in Distributed Database Systems 面向分布式数据库系统的非2pc事务管理

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2882923

Qian Lin, Pengfei Chang, Gang Chen, B. Ooi, K. Tan, Zhengkui Wang

Shared-nothing architecture has been widely used in distributed databases to achieve good scalability. While it offers superior performance for local transactions, the overhead of processing distributed transactions can degrade the system performance significantly. The key contributor to the degradation is the expensive two-phase commit (2PC) protocol used to ensure atomic commitment of distributed transactions. In this paper, we propose a transaction management scheme called LEAP to avoid the 2PC protocol within distributed transaction processing. Instead of processing a distributed transaction across multiple nodes, LEAP converts the distributed transaction into a local transaction. This benefits the processing locality and facilitates adaptive data repartitioning when there is a change in data access pattern. Based on LEAP, we develop an online transaction processing (OLTP) system, L-Store, and compare it with the state-of-the-art distributed in-memory OLTP system, H-Store, which relies on the 2PC protocol for distributed transaction processing, and H^L-Store, a H-Store that has been modified to make use of LEAP. Results of an extensive experimental evaluation show that our LEAP-based engines are superior over H-Store by a wide margin, especially for workloads that exhibit locality-based data accesses.

无共享架构在分布式数据库中得到了广泛的应用，以获得良好的可扩展性。虽然它为本地事务提供了优越的性能，但处理分布式事务的开销可能会显著降低系统性能。导致性能下降的关键因素是昂贵的两阶段提交(2PC)协议，该协议用于确保分布式事务的原子提交。为了避免分布式事务处理中的2PC协议，本文提出了一种称为LEAP的事务管理方案。LEAP不是跨多个节点处理分布式事务，而是将分布式事务转换为本地事务。这有利于处理局部性，并便于在数据访问模式发生变化时进行自适应数据重分区。基于LEAP，我们开发了一个在线事务处理(OLTP)系统L-Store，并将其与最先进的分布式内存OLTP系统H- store和H^L-Store进行了比较。H- store依赖于2PC协议进行分布式事务处理，H^L-Store是一个经过修改以利用LEAP的H- store。广泛的实验评估结果表明，我们基于leap的引擎在很大程度上优于H-Store，特别是对于显示基于位置的数据访问的工作负载。

{"title":"Towards a Non-2PC Transaction Management in Distributed Database Systems","authors":"Qian Lin, Pengfei Chang, Gang Chen, B. Ooi, K. Tan, Zhengkui Wang","doi":"10.1145/2882903.2882923","DOIUrl":"https://doi.org/10.1145/2882903.2882923","url":null,"abstract":"Shared-nothing architecture has been widely used in distributed databases to achieve good scalability. While it offers superior performance for local transactions, the overhead of processing distributed transactions can degrade the system performance significantly. The key contributor to the degradation is the expensive two-phase commit (2PC) protocol used to ensure atomic commitment of distributed transactions. In this paper, we propose a transaction management scheme called LEAP to avoid the 2PC protocol within distributed transaction processing. Instead of processing a distributed transaction across multiple nodes, LEAP converts the distributed transaction into a local transaction. This benefits the processing locality and facilitates adaptive data repartitioning when there is a change in data access pattern. Based on LEAP, we develop an online transaction processing (OLTP) system, L-Store, and compare it with the state-of-the-art distributed in-memory OLTP system, H-Store, which relies on the 2PC protocol for distributed transaction processing, and H^L-Store, a H-Store that has been modified to make use of LEAP. Results of an extensive experimental evaluation show that our LEAP-based engines are superior over H-Store by a wide margin, especially for workloads that exhibit locality-based data accesses.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86541578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Local Similarity Search for Unstructured Text 非结构化文本的局部相似度搜索

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915211

Pei Wang, Chuan Xiao, Jianbin Qin, Wei Wang, Xiaoyan Zhang, Y. Ishikawa

With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated documents, our target is to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens. Our problem is technically challenging because for sliding windows the tokens to be indexed are less selective than entire documents, rendering set similarity join-based algorithms less efficient. Our proposed method is based on enumerating token combinations to obtain signatures with high selectivity. In order to strike a balance between signature and candidate generation, we partition the token universe and for different partitions we generate combinations composed of different numbers of tokens. A cost-aware algorithm is devised to find a good partitioning of the token universe. We also propose to leverage the overlap between adjacent windows to share computation and thus speed up query processing. In addition, we develop the techniques to support the large thresholds. Experiments on real datasets demonstrate the efficiency of our method against alternative solutions.

随着电子文档的日益普及，复制的发生有很多原因。人们可以从各种来源复制文本片段并进行修改。在本文中，我们研究了局部相似搜索的问题，以找到部分复制的文本。与现有的相似性搜索研究发现完全重复的文档不同，我们的目标是识别大约共享一对滑动窗口的文档，其差异不超过τ个令牌。我们的问题在技术上是具有挑战性的，因为对于滑动窗口，要索引的令牌比整个文档的选择性要低，使得基于集合相似度连接的算法效率较低。我们提出的方法是基于枚举令牌组合来获得高选择性的签名。为了在签名和候选生成之间取得平衡，我们对令牌域进行分区，对于不同的分区，我们生成由不同数量的令牌组成的组合。设计了一种成本感知算法来找到令牌域的良好划分。我们还建议利用相邻窗口之间的重叠来共享计算，从而加快查询处理。此外，我们还开发了支持大阈值的技术。在真实数据集上的实验证明了我们的方法对替代解决方案的有效性。

{"title":"Local Similarity Search for Unstructured Text","authors":"Pei Wang, Chuan Xiao, Jianbin Qin, Wei Wang, Xiaoyan Zhang, Y. Ishikawa","doi":"10.1145/2882903.2915211","DOIUrl":"https://doi.org/10.1145/2882903.2915211","url":null,"abstract":"With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated documents, our target is to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens. Our problem is technically challenging because for sliding windows the tokens to be indexed are less selective than entire documents, rendering set similarity join-based algorithms less efficient. Our proposed method is based on enumerating token combinations to obtain signatures with high selectivity. In order to strike a balance between signature and candidate generation, we partition the token universe and for different partitions we generate combinations composed of different numbers of tokens. A cost-aware algorithm is devised to find a good partitioning of the token universe. We also propose to leverage the overlap between adjacent windows to share computation and thus speed up query processing. In addition, we develop the techniques to support the large thresholds. Experiments on real datasets demonstrate the efficiency of our method against alternative solutions.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76357666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Scaling Multicore Databases via Constrained Parallel Execution 通过约束并行执行扩展多核数据库

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2882934

Zhaoguo Wang, Shuai Mu, Yang Cui, Han Yi, Haibo Chen, Jinyang Li

Multicore in-memory databases often rely on traditional con- currency control schemes such as two-phase-locking (2PL) or optimistic concurrency control (OCC). Unfortunately, when the workload exhibits a non-trivial amount of contention, both 2PL and OCC sacrifice much parallel execution op- portunity. In this paper, we describe a new concurrency control scheme, interleaving constrained concurrency con- trol (IC3), which provides serializability while allowing for parallel execution of certain conflicting transactions. IC3 combines the static analysis of the transaction workload with runtime techniques that track and enforce dependencies among concurrent transactions. The use of static analysis simplifies IC3's runtime design, allowing it to scale to many cores. Evaluations on a 64-core machine using the TPC- C benchmark show that IC3 outperforms traditional con- currency control schemes under contention. It achieves the throughput of 434K transactions/sec on the TPC-C bench- mark configured with only one warehouse. It also scales better than several recent concurrent control schemes that also target contended workloads.

多核内存数据库通常依赖于传统的虚拟货币控制方案，如两阶段锁定(2PL)或乐观并发控制(OCC)。不幸的是，当工作负载显示出大量的争用时，2PL和OCC都牺牲了许多并行执行机会。在本文中，我们描述了一种新的并发控制方案，交错约束并发控制(IC3)，它提供了串行性，同时允许并行执行某些冲突事务。IC3将事务工作负载的静态分析与跟踪和执行并发事务之间的依赖关系的运行时技术相结合。静态分析的使用简化了IC3的运行时设计，允许它扩展到多个内核。在64核机器上使用TPC- C基准测试的评估表明，IC3在争用下优于传统的代币控制方案。它在仅配置一个仓库的TPC-C基准测试上实现了434K事务/秒的吞吐量。它的可伸缩性也优于最近的几种针对竞争工作负载的并发控制方案。

{"title":"Scaling Multicore Databases via Constrained Parallel Execution","authors":"Zhaoguo Wang, Shuai Mu, Yang Cui, Han Yi, Haibo Chen, Jinyang Li","doi":"10.1145/2882903.2882934","DOIUrl":"https://doi.org/10.1145/2882903.2882934","url":null,"abstract":"Multicore in-memory databases often rely on traditional con- currency control schemes such as two-phase-locking (2PL) or optimistic concurrency control (OCC). Unfortunately, when the workload exhibits a non-trivial amount of contention, both 2PL and OCC sacrifice much parallel execution op- portunity. In this paper, we describe a new concurrency control scheme, interleaving constrained concurrency con- trol (IC3), which provides serializability while allowing for parallel execution of certain conflicting transactions. IC3 combines the static analysis of the transaction workload with runtime techniques that track and enforce dependencies among concurrent transactions. The use of static analysis simplifies IC3's runtime design, allowing it to scale to many cores. Evaluations on a 64-core machine using the TPC- C benchmark show that IC3 outperforms traditional con- currency control schemes under contention. It achieves the throughput of 434K transactions/sec on the TPC-C bench- mark configured with only one warehouse. It also scales better than several recent concurrent control schemes that also target contended workloads.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79574668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

DBSherlock: A Performance Diagnostic Tool for Transactional Databases DBSherlock:事务性数据库的性能诊断工具

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915218

Dong Young Yoon, Ning Niu, Barzan Mozafari

Running an online transaction processing (OLTP) system is one of the most daunting tasks required of database administrators (DBAs). As businesses rely on OLTP databases to support their mission-critical and real-time applications, poor database performance directly impacts their revenue and user experience. As a result, DBAs constantly monitor, diagnose, and rectify any performance decays. Unfortunately, the manual process of debugging and diagnosing OLTP performance problems is extremely tedious and non-trivial. Rather than being caused by a single slow query, performance problems in OLTP databases are often due to a large number of concurrent and competing transactions adding up to compounded, non-linear effects that are difficult to isolate. Sudden changes in request volume, transactional patterns, network traffic, or data distribution can cause previously abundant resources to become scarce, and the performance to plummet. This paper presents a practical tool for assisting DBAs in quickly and reliably diagnosing performance problems in an OLTP database. By analyzing hundreds of statistics and configurations collected over the lifetime of the system, our algorithm quickly identifies a small set of potential causes and presents them to the DBA. The root-cause established by the DBA is reincorporated into our algorithm as a new causal model to improve future diagnoses. Our experiments show that this algorithm is substantially more accurate than the state-of-the-art algorithm in finding correct explanations.

运行在线事务处理(OLTP)系统是数据库管理员(dba)需要完成的最艰巨的任务之一。由于企业依赖OLTP数据库来支持其关键任务和实时应用程序，因此较差的数据库性能会直接影响其收入和用户体验。因此，dba不断地监视、诊断和纠正任何性能下降。不幸的是，调试和诊断OLTP性能问题的手动过程非常繁琐和重要。OLTP数据库中的性能问题通常不是由单个缓慢的查询引起的，而是由于大量并发和竞争事务叠加在一起造成了难以隔离的复合非线性影响。请求量、事务模式、网络流量或数据分布的突然变化可能导致以前丰富的资源变得稀缺，性能急剧下降。本文提供了一个实用的工具，可以帮助dba快速可靠地诊断OLTP数据库中的性能问题。通过分析在系统生命周期内收集的数百个统计信息和配置，我们的算法可以快速识别一小部分潜在原因，并将其呈现给DBA。DBA建立的根本原因被重新纳入我们的算法，作为一个新的因果模型，以提高未来的诊断。我们的实验表明，该算法在寻找正确的解释方面比最先进的算法要准确得多。

{"title":"DBSherlock: A Performance Diagnostic Tool for Transactional Databases","authors":"Dong Young Yoon, Ning Niu, Barzan Mozafari","doi":"10.1145/2882903.2915218","DOIUrl":"https://doi.org/10.1145/2882903.2915218","url":null,"abstract":"Running an online transaction processing (OLTP) system is one of the most daunting tasks required of database administrators (DBAs). As businesses rely on OLTP databases to support their mission-critical and real-time applications, poor database performance directly impacts their revenue and user experience. As a result, DBAs constantly monitor, diagnose, and rectify any performance decays. Unfortunately, the manual process of debugging and diagnosing OLTP performance problems is extremely tedious and non-trivial. Rather than being caused by a single slow query, performance problems in OLTP databases are often due to a large number of concurrent and competing transactions adding up to compounded, non-linear effects that are difficult to isolate. Sudden changes in request volume, transactional patterns, network traffic, or data distribution can cause previously abundant resources to become scarce, and the performance to plummet. This paper presents a practical tool for assisting DBAs in quickly and reliably diagnosing performance problems in an OLTP database. By analyzing hundreds of statistics and configurations collected over the lifetime of the system, our algorithm quickly identifies a small set of potential causes and presents them to the DBA. The root-cause established by the DBA is reincorporated into our algorithm as a new causal model to improve future diagnoses. Our experiments show that this algorithm is substantially more accurate than the state-of-the-art algorithm in finding correct explanations.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"196 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79882225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Wander Join: Online Aggregation for Joins Wander Join:连接的在线聚合

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899413

Feifei Li, Bin Wu, K. Yi, Zhuoyue Zhao

Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and may also need very restrictive assumptions (e.g., tuples in a table are stored in random order). We introduce a new approach, wander join, to the online aggregation problem by performing random walks over the underlying join graph. We have also implemented and tested wander join in the latest PostgreSQL.

连接是昂贵的，在线聚合取代连接是为了降低成本而提出的，它以连续的在线方式在查询效率和准确性之间提供了一个很好的灵活的权衡。然而，在内部和外部内存中，最先进的方法是基于波纹连接，这仍然非常昂贵，并且可能还需要非常严格的假设(例如，表中的元组以随机顺序存储)。我们引入了一种新的方法，漫游连接，通过在底层连接图上执行随机漫步来解决在线聚合问题。我们还在最新的PostgreSQL中实现并测试了wander join。

引用次数: 9

Top-k Relevant Semantic Place Retrieval on Spatial RDF Data 空间RDF数据Top-k相关语义位置检索

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2882941

Jieming Shi, Dingming Wu, N. Mamoulis

RDF data are traditionally accessed using structured query languages, such as SPARQL. However, this requires users to understand the language as well as the RDF schema. Keyword search on RDF data aims at relieving the user from these requirements; the user only inputs a set of keywords and the goal is to find small RDF subgraphs which contain all keywords. At the same time, popular RDF knowledge bases also include spatial semantics, which opens the road to location-based search operations. In this work, we propose and study a novel location-based keyword search query on RDF data. The objective of top-k relevant semantic places (kSP) retrieval is to find RDF subgraphs which contain the query keywords and are rooted at spatial entities close to the query location. The novelty of kSP queries is that they are location-aware and that they do not rely on the use of structured query languages. We design a basic method for the processing of kSP queries. To further accelerate kSP retrieval, two pruning approaches and a data preprocessing technique are proposed. Extensive empirical studies on two real datasets demonstrate the superior and robust performance of our proposals compared to the basic method.

传统上使用结构化查询语言(如SPARQL)访问RDF数据。但是，这需要用户理解该语言以及RDF模式。RDF数据的关键字搜索就是为了使用户从这些需求中解脱出来;用户只输入一组关键字，目标是找到包含所有关键字的RDF子图。同时，流行的RDF知识库还包括空间语义，这为基于位置的搜索操作开辟了道路。在这项工作中，我们提出并研究了一种新的基于位置的RDF关键字搜索查询。top-k相关语义位置(kSP)检索的目标是查找包含查询关键字的RDF子图，这些子图植根于靠近查询位置的空间实体。kSP查询的新颖之处在于它们是位置感知的，并且不依赖于结构化查询语言的使用。我们设计了一个处理kSP查询的基本方法。为了进一步加快kSP检索速度，提出了两种剪枝方法和一种数据预处理技术。在两个真实数据集上的大量实证研究表明，与基本方法相比，我们的建议具有优越的鲁棒性。

{"title":"Top-k Relevant Semantic Place Retrieval on Spatial RDF Data","authors":"Jieming Shi, Dingming Wu, N. Mamoulis","doi":"10.1145/2882903.2882941","DOIUrl":"https://doi.org/10.1145/2882903.2882941","url":null,"abstract":"RDF data are traditionally accessed using structured query languages, such as SPARQL. However, this requires users to understand the language as well as the RDF schema. Keyword search on RDF data aims at relieving the user from these requirements; the user only inputs a set of keywords and the goal is to find small RDF subgraphs which contain all keywords. At the same time, popular RDF knowledge bases also include spatial semantics, which opens the road to location-based search operations. In this work, we propose and study a novel location-based keyword search query on RDF data. The objective of top-k relevant semantic places (kSP) retrieval is to find RDF subgraphs which contain the query keywords and are rooted at spatial entities close to the query location. The novelty of kSP queries is that they are location-aware and that they do not rely on the use of structured query languages. We design a basic method for the processing of kSP queries. To further accelerate kSP retrieval, two pruning approaches and a data preprocessing technique are proposed. Extensive empirical studies on two real datasets demonstrate the superior and robust performance of our proposals compared to the basic method.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84330807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

FERARI: A Prototype for Complex Event Processing over Streaming Multi-cloud Platforms FERARI:流多云平台上复杂事件处理的原型

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899395

Ioannis Flouris, Vasiliki Manikaki, Nikos Giatrakos, Antonios Deligiannakis, M. Garofalakis, M. Mock, Sebastian Bothe, Inna Skarbovsky, Fabiana Fournier, Marko Stajcer, Tomislav Krizan, Jonathan Yom-Tov, Taji Curin

In this demo, we present FERARI, a prototype that enables real-time Complex Event Processing (CEP) for large volume event data streams over distributed topologies. Our prototype constitutes, to our knowledge, the first complete, multi-cloud based end-to-end CEP solution incorporating: a) a user-friendly, web-based query authoring tool, (b) a powerful CEP engine implemented on top of a streaming cloud platform, (c) a CEP optimizer that chooses the best query execution plan with respect to low latency and/or reduced inter-cloud communication burden, and (d) a query analytics dashboard encompassing graph and map visualization tools to provide a holistic picture with respect to the detected complex events to final stakeholders. As a proof-of-concept, we apply FERARI to enable mobile fraud detection over real, properly anonymized, telecommunication data from T-Hrvatski Telekom network in Croatia.

在这个演示中，我们展示了FERARI，这是一个原型，可以对分布式拓扑上的大容量事件数据流进行实时复杂事件处理(CEP)。据我们所知，我们的原型构成了第一个完整的、基于多云的端到端CEP解决方案，包括:a)用户友好的、基于web的查询编写工具，(b)在流云平台上实现的强大的CEP引擎，(c)选择最佳查询执行计划的CEP优化器，考虑到低延迟和/或减少云间通信负担，以及(d)包含图形和地图可视化工具的查询分析仪表板，为最终利益相关者提供有关检测到的复杂事件的整体图片。作为概念验证，我们应用FERARI对来自克罗地亚T-Hrvatski电信网络的真实、适当匿名的电信数据进行移动欺诈检测。

引用次数: 23

Semistructured Models, Queries and Algebras in the Big Data Era: Tutorial Summary 大数据时代的半结构化模型、查询和代数:教程摘要

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2912573

Y. Papakonstantinou

Numerous databases promoted as SQL-on-Hadoop, NewSQL and NoSQL support semi-structured, schemaless and heterogeneous data, typically in the form of enriched JSON. They also provide corresponding query languages. In addition to these genuine JSON databases, relational databases also provide special functions and language features for the support of JSON columns, typically piggybacking on non-1NF (non first normal form) features that SQL acquired over the years. We refer to SQL databases with JSON support as SQL/JSON databases. The evolving query languages present multiple variations: Some are superficial syntactic ones, while other ones are genuine differences in modeling, language capabilities and semantics. Incompatibility with SQL presents a learning challenge for genuine JSON databases, while the table orientation of SQL/JSON databases often leads to cumbersome syntactic/semantic structures that are contrary to the semistructured nature of JSON. Furthermore, the query languages often fall short of full-fledged semistructured query language capabilities, when compared to the yardstick set by XQuery and prior works on semistructured data (even after superficial model differences are abstracted out). We survey features, the designers' options and differences in the approaches taken by actual systems. In particular, we first present a SQL backwards-compatible language, named SQL++, which can access both SQL and JSON data. SQL++ is expected to be supported by Couchbase's CouchDB and UCI's AsterixDB semistructured databases. Then we expand SQL++ into the Configurable SQL++, whereas multiple possible (and different) semantics are formally captured by the multiple options that the language's semantic configuration options can take. We show how appropriate setting of the configuration options morphs the Configurable SQL++ semantics into the semantics of 10 surveyed languages, hence providing a compact and formal tool to understand the essential semantic differences between different systems. We briefly comment on the utility of formally capturing semantic variations in polystore systems. Finally we discuss the comparison with prior nested and semistructured query languages (notably OQL and XQuery) and describe a key aspect of query processor implementation: set-oriented semistructured query algebras. In particular, we transfer into the JSON era lessons from the semistructured query processing research of the 90s and 00s and combine them with insights on current JSON databases. Again, the tutorial presents the algebras' fundamentals while it abstracts away modeling differences that are not applicable.

许多被推广为SQL-on-Hadoop、NewSQL和NoSQL的数据库支持半结构化、无模式和异构数据，通常以丰富的JSON形式出现。它们还提供相应的查询语言。除了这些真正的JSON数据库之外，关系数据库还为支持JSON列提供了特殊的函数和语言特性，通常附带SQL多年来获得的非1nf(非第一范式)特性。我们将支持JSON的SQL数据库称为SQL/JSON数据库。不断发展的查询语言呈现出多种变化:一些是表面的语法变化，而另一些则是在建模、语言能力和语义方面的真正差异。与SQL的不兼容性对真正的JSON数据库提出了一个学习挑战，而SQL/JSON数据库的面向表通常会导致繁琐的语法/语义结构，这与JSON的半结构化性质相反。此外，与XQuery设置的标准和以前在半结构化数据上的工作相比(即使抽象出表面的模型差异)，查询语言往往缺乏成熟的半结构化查询语言功能。我们调查的特点，设计师的选择和实际系统所采取的方法的差异。特别地，我们首先提出了一种SQL向后兼容的语言，名为SQL++，它可以访问SQL和JSON数据。SQL++有望得到Couchbase的CouchDB和UCI的AsterixDB半结构化数据库的支持。然后我们将SQL++扩展为可配置的SQL++，而语言的语义配置选项可以采用的多个选项正式捕获多个可能的(和不同的)语义。我们展示了配置选项的适当设置如何将可配置的SQL++语义转换为10种调查语言的语义，从而提供了一个紧凑而正式的工具来理解不同系统之间的基本语义差异。我们简要地评论了在多存储系统中正式捕获语义变化的效用。最后，我们讨论了与先前嵌套和半结构化查询语言(特别是OQL和XQuery)的比较，并描述了查询处理器实现的一个关键方面:面向集合的半结构化查询代数。特别地，我们从90年代和00年代的半结构化查询处理研究中吸取了JSON时代的经验教训，并将它们与对当前JSON数据库的见解结合起来。同样，本教程介绍了代数的基础知识，同时抽象了不适用的建模差异。

{"title":"Semistructured Models, Queries and Algebras in the Big Data Era: Tutorial Summary","authors":"Y. Papakonstantinou","doi":"10.1145/2882903.2912573","DOIUrl":"https://doi.org/10.1145/2882903.2912573","url":null,"abstract":"Numerous databases promoted as SQL-on-Hadoop, NewSQL and NoSQL support semi-structured, schemaless and heterogeneous data, typically in the form of enriched JSON. They also provide corresponding query languages. In addition to these genuine JSON databases, relational databases also provide special functions and language features for the support of JSON columns, typically piggybacking on non-1NF (non first normal form) features that SQL acquired over the years. We refer to SQL databases with JSON support as SQL/JSON databases. The evolving query languages present multiple variations: Some are superficial syntactic ones, while other ones are genuine differences in modeling, language capabilities and semantics. Incompatibility with SQL presents a learning challenge for genuine JSON databases, while the table orientation of SQL/JSON databases often leads to cumbersome syntactic/semantic structures that are contrary to the semistructured nature of JSON. Furthermore, the query languages often fall short of full-fledged semistructured query language capabilities, when compared to the yardstick set by XQuery and prior works on semistructured data (even after superficial model differences are abstracted out). We survey features, the designers' options and differences in the approaches taken by actual systems. In particular, we first present a SQL backwards-compatible language, named SQL++, which can access both SQL and JSON data. SQL++ is expected to be supported by Couchbase's CouchDB and UCI's AsterixDB semistructured databases. Then we expand SQL++ into the Configurable SQL++, whereas multiple possible (and different) semantics are formally captured by the multiple options that the language's semantic configuration options can take. We show how appropriate setting of the configuration options morphs the Configurable SQL++ semantics into the semantics of 10 surveyed languages, hence providing a compact and formal tool to understand the essential semantic differences between different systems. We briefly comment on the utility of formally capturing semantic variations in polystore systems. Finally we discuss the comparison with prior nested and semistructured query languages (notably OQL and XQuery) and describe a key aspect of query processor implementation: set-oriented semistructured query algebras. In particular, we transfer into the JSON era lessons from the semistructured query processing research of the 90s and 00s and combine them with insights on current JSON databases. Again, the tutorial presents the algebras' fundamentals while it abstracts away modeling differences that are not applicable.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85643230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Constance: An Intelligent Data Lake System 康斯坦斯:智能数据湖系统

Proceedings of the 2016 International Conference on Management of Data

Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899389

Rihan Hai, Sandra Geisler, C. Quix

As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.

作为我们时代的挑战，大数据的研究仍然存在许多问题，尤其是数据的多样性。数据源的高度多样性通常会导致信息孤岛，这是一组具有异构模式、查询语言和api的非集成数据管理系统。数据湖系统已经被提出作为解决这个问题的方案，它为原始数据提供一个无模式的存储库，并提供一个通用的访问接口。然而，仅仅将所有数据倾倒到数据湖中而不进行任何元数据管理，只会导致“数据沼泽”。为了避免这种情况，我们提出Constance，这是一个数据湖系统，对从异构数据源提取的原始数据进行了复杂的元数据管理。Constance从数据源中发现、提取和总结结构化元数据，并用语义信息对数据和元数据进行标注，避免歧义。通过支持结构化数据和半结构化数据的嵌入式查询重写引擎，Constance为用户提供了查询处理和数据探索的统一界面。在演示过程中，我们将介绍Constance的每个功能组件。Constance将应用于两个现实生活中的用例，以向与会者展示我们的通用和可扩展数据湖系统的重要性和有用性。

{"title":"Constance: An Intelligent Data Lake System","authors":"Rihan Hai, Sandra Geisler, C. Quix","doi":"10.1145/2882903.2899389","DOIUrl":"https://doi.org/10.1145/2882903.2899389","url":null,"abstract":"As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79424998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 199