Proceedings. ACM-SIGMOD International Conference on Management of Data最新文献

英文中文

Query processing on prefix trees live 对前缀树进行查询处理

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463682

T. Kissinger, B. Schlegel, Dirk Habich, Wolfgang Lehner

Modern database systems have to process huge amounts of data and should provide results with low latency at the same time. To achieve this, data is nowadays typically hold completely in main memory, to benefit of its high bandwidth and low access latency that could never be reached with disks. Current in-memory databases are usually column-stores that exchange columns or vectors between operators and suffer from a high tuple reconstruction overhead. In this demonstration proposal, we present DexterDB, which implements our novel prefix tree-based processing model that makes indexes the first-class citizen of the database system. The core idea is that each operator takes a set of indexes as input and builds a new index as output that is indexed on the attribute requested by the successive operator. With that, we are able to build composed operators, like the multi-way-select-join-group. Such operators speed up the processing of complex OLAP queries so that DexterDB outperforms state-of-the-art in-memory databases. Our demonstration focuses on the different optimization options for such query plans. Hence, we built an interactive GUI that connects to a DexterDB instance and allows the manipulation of query optimization parameters. The generated query plans and important execution statistics are visualized to help the visitor to understand our processing model.

现代数据库系统必须处理大量数据，同时提供低延迟的结果。为了实现这一点，现在数据通常完全保存在主存中，以利用磁盘无法达到的高带宽和低访问延迟。当前的内存数据库通常是列存储，在操作符之间交换列或向量，并且存在很高的元组重构开销。在这个演示提案中，我们介绍了DexterDB，它实现了我们新颖的基于前缀树的处理模型，使索引成为数据库系统的一等公民。核心思想是，每个操作符将一组索引作为输入，并构建一个新的索引作为输出，该索引根据后续操作符请求的属性建立索引。有了它，我们就能够构建组合操作符，比如多路选择-连接-组。这样的操作符加快了复杂OLAP查询的处理速度，使DexterDB的性能优于最先进的内存数据库。我们的演示主要关注此类查询计划的不同优化选项。因此，我们构建了一个交互式GUI，它连接到一个DexterDB实例，并允许对查询优化参数进行操作。生成的查询计划和重要的执行统计数据是可视化的，以帮助访问者理解我们的处理模型。

{"title":"Query processing on prefix trees live","authors":"T. Kissinger, B. Schlegel, Dirk Habich, Wolfgang Lehner","doi":"10.1145/2463676.2463682","DOIUrl":"https://doi.org/10.1145/2463676.2463682","url":null,"abstract":"Modern database systems have to process huge amounts of data and should provide results with low latency at the same time. To achieve this, data is nowadays typically hold completely in main memory, to benefit of its high bandwidth and low access latency that could never be reached with disks. Current in-memory databases are usually column-stores that exchange columns or vectors between operators and suffer from a high tuple reconstruction overhead. In this demonstration proposal, we present DexterDB, which implements our novel prefix tree-based processing model that makes indexes the first-class citizen of the database system. The core idea is that each operator takes a set of indexes as input and builds a new index as output that is indexed on the attribute requested by the successive operator. With that, we are able to build composed operators, like the multi-way-select-join-group. Such operators speed up the processing of complex OLAP queries so that DexterDB outperforms state-of-the-art in-memory databases. Our demonstration focuses on the different optimization options for such query plans. Hence, we built an interactive GUI that connects to a DexterDB instance and allows the manipulation of query optimization parameters. The generated query plans and important execution statistics are visualized to help the visitor to understand our processing model.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85505012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Reverse engineering complex join queries 逆向工程复杂的连接查询

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465320

Meihui Zhang, Hazem Elmeleegy, Cecilia M. Procopiuc, D. Srivastava

We study the following problem: Given a database D with schema G and an output table Out, compute a join query Q that generates OUT from D. A simpler variant allows Q to return a superset of Out. This problem has numerous applications, both by itself, and as a building block for other problems. Related prior work imposes conditions on the structure of Q which are not always consistent with the application, but simplify computation. We discuss several natural SQL queries that do not satisfy these conditions and cannot be discovered by prior work. In this paper, we propose an efficient algorithm that discovers queries with arbitrary join graphs. A crucial insight is that any graph can be characterized by the combination of a simple structure, called a star, and a series of merge steps over the star. The merge steps define a lattice over graphs derived from the same star. This allows us to explore the set of candidate solutions in a principled way and quickly prune out a large number of infeasible graphs. We also design several optimizations that significantly reduce the running time. Finally, we conduct an extensive experimental study over a benchmark database and show that our approach is scalable and accurately discovers complex join queries.

我们研究以下问题:给定一个模式为G的数据库D和一个输出表Out，计算一个从D生成Out的连接查询Q，一个更简单的变体允许Q返回Out的超集。这个问题有很多应用，无论是它本身，还是作为其他问题的构建块。前人的相关工作对Q的结构施加了条件，这些条件并不总是与实际应用相一致，而是简化了计算。我们将讨论几个不满足这些条件的自然SQL查询，这些查询无法通过先前的工作发现。在本文中，我们提出了一种有效的算法来发现具有任意连接图的查询。一个关键的见解是，任何图形都可以通过一个简单的结构(称为恒星)和恒星上的一系列合并步骤的组合来表征。合并步骤定义了从同一颗星派生的图上的晶格。这使我们能够以有原则的方式探索候选解集，并快速修剪出大量不可行的图。我们还设计了几个显著减少运行时间的优化。最后，我们在基准数据库上进行了广泛的实验研究，并表明我们的方法是可扩展的，并且可以准确地发现复杂的连接查询。

{"title":"Reverse engineering complex join queries","authors":"Meihui Zhang, Hazem Elmeleegy, Cecilia M. Procopiuc, D. Srivastava","doi":"10.1145/2463676.2465320","DOIUrl":"https://doi.org/10.1145/2463676.2465320","url":null,"abstract":"We study the following problem: Given a database D with schema G and an output table Out, compute a join query Q that generates OUT from D. A simpler variant allows Q to return a superset of Out. This problem has numerous applications, both by itself, and as a building block for other problems. Related prior work imposes conditions on the structure of Q which are not always consistent with the application, but simplify computation. We discuss several natural SQL queries that do not satisfy these conditions and cannot be discovered by prior work.\u0000 In this paper, we propose an efficient algorithm that discovers queries with arbitrary join graphs. A crucial insight is that any graph can be characterized by the combination of a simple structure, called a star, and a series of merge steps over the star. The merge steps define a lattice over graphs derived from the same star. This allows us to explore the set of candidate solutions in a principled way and quickly prune out a large number of infeasible graphs. We also design several optimizations that significantly reduce the running time. Finally, we conduct an extensive experimental study over a benchmark database and show that our approach is scalable and accurately discovers complex join queries.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82784276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 76

Value invention in data exchange 数据交换中的价值创造

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465311

Patricia C. Arocena, Boris Glavic, Renée J. Miller

The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.

创造值来表示不完整的信息，通常被称为价值发明，是数据交换的核心。在模式映射中，Skolem函数长期以来一直用于值创建，因为它们允许对缺失的信息进行精确表示。最近对一种名为二阶元组生成依赖关系(SO tgds)的强大映射语言的研究引起了人们的注意，即在数据交换中使用任意Skolem函数可能具有负面的计算性和可编程性。在本文中，我们提出了两种技术，用于理解用于表示不完全信息的正确语义的Skolem函数何时在计算上表现良好。具体来说，我们考虑二阶(SO)映射中的Skolem函数何时具有一阶(FO)语义，从而在编程和计算上更适合在实践中使用。我们的第一种技术，线性化，通过理解映射中Skolem函数的参数集何时通过集合包含相关联，极大地扩展了Nash, Bernstein和Melnik非Skolem化算法。我们证明了这种线性关系导致具有FO语义的映射，并且可以用流行的映射语言表示，包括源到目标的tgds和嵌套的tgds。我们的第二种技术使用源语义，特别是功能依赖项(包括键)，将SO映射转换为等效的FO映射。我们证明了我们的算法比以前的方法适用于更大的映射类别，但更重要的是，我们提出了一个广泛的实验评估，量化了这种差异(大约78%的改进)，并说明了我们的结果在实际映射上的适用性。

{"title":"Value invention in data exchange","authors":"Patricia C. Arocena, Boris Glavic, Renée J. Miller","doi":"10.1145/2463676.2465311","DOIUrl":"https://doi.org/10.1145/2463676.2465311","url":null,"abstract":"The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2463676.2465311","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72435709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

An online cost sensitive decision-making method in crowdsourcing systems 众包系统中在线成本敏感决策方法

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465307

Jinyang Gao, Xuan Liu, B. Ooi, Haixun Wang, Gang Chen

Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed. In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.

通过利用人类的智慧，众包为许多具有挑战性的问题创造了各种各样的机会。例如，图像标记、自然语言处理和基于语义的信息检索等应用程序可以利用基于人群的人类计算来补充现有的计算算法。当然，在众包中，人类工作者会根据自己的知识、经验和感知来解决问题。因此，不清楚哪些问题可以通过众包来更好地解决，而不是仅仅使用传统的基于机器的方法来解决。因此，需要一种成本敏感的定量分析方法。在本文中，我们设计并实现了一种成本敏感的众包方法。我们在线估算众包工作的利润，以便那些没有未来利润的问题可以终止众包。提出了两种估算众包作业利润的模型，即线性值模型和广义非线性模型。使用这些模型，根据已经收到的答案计算特定问题获得新答案的预期利润。如果获得更多答案的边际预期利润不为正，则问题将被实时终止。我们将该方法扩展到在HIT中发布一批问题。我们使用AMT上的两个真实世界作业来评估我们提出的方法的有效性。实验结果表明，本文提出的方法优于现有的方法。

{"title":"An online cost sensitive decision-making method in crowdsourcing systems","authors":"Jinyang Gao, Xuan Liu, B. Ooi, Haixun Wang, Gang Chen","doi":"10.1145/2463676.2465307","DOIUrl":"https://doi.org/10.1145/2463676.2465307","url":null,"abstract":"Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed.\u0000 In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74476154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

The power of data use management in action 数据使用管理的力量

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465264

P. Upadhyaya, Nick R. Anderson, M. Balazinska, Bill Howe, R. Kaushik, Ravishankar Ramamurthy, Dan Suciu

In this demonstration, we show-case a database management system extended with a new type of component that we call a Data Use Manager (DUM). The DUM enables DBAs to attach policies to data loaded into the DBMS. It then monitors how users query the data, flags potential policy violations, recommends possible fixes, and supports offline analysis of user activities related to data policies. The demonstration uses real healthcare data.

在本演示中，我们展示了一个数据库管理系统，该系统扩展了一种称为数据使用管理器(Data Use Manager, DUM)的新型组件。DUM使dba能够将策略附加到加载到DBMS中的数据上。然后，它监视用户如何查询数据，标记潜在的策略违规，建议可能的修复，并支持与数据策略相关的用户活动的离线分析。该演示使用真实的医疗保健数据。

引用次数: 3

Mind the gap: large-scale frequent sequence mining 注意差距:大规模频繁的序列挖掘

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465285

Iris Miliaraki, K. Berberich, Rainer Gemulla, Spyros Zoupanos

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called "gap constraints", which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a "projected database" used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.

频繁序列挖掘是数据挖掘的基本组成部分之一。虽然这个问题已经得到了广泛的研究，但很少有可用的技术能够充分扩展到处理具有数十亿序列的数据集;例如，这种大规模数据集出现在文本挖掘和会话分析中。本文提出了一种用于MapReduce频繁序列挖掘的可扩展算法MG-FSM。MG-FSM可以处理所谓的“间隙约束”，它可以用来限制输出到一组受控的频繁序列。在其核心，MG-FSM以一种允许我们使用任何现有的频繁序列挖掘算法独立挖掘每个分区的方式对输入数据库进行分区。我们引入了w-等价的概念，这是许多频繁模式挖掘算法所使用的“投影数据库”概念的推广。我们还介绍了一些优化技术，这些技术可以最小化分区大小，从而减少计算和通信成本，同时仍然保持正确性。我们在文本挖掘上下文中的实验研究表明，MG-FSM比其他方法更有效和可扩展。

{"title":"Mind the gap: large-scale frequent sequence mining","authors":"Iris Miliaraki, K. Berberich, Rainer Gemulla, Spyros Zoupanos","doi":"10.1145/2463676.2465285","DOIUrl":"https://doi.org/10.1145/2463676.2465285","url":null,"abstract":"Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called \"gap constraints\", which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a \"projected database\" used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75048058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

Resa: realtime elastic streaming analytics in the cloud Resa:云中的实时弹性流分析

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465343

Tian Tan, Richard T. B. Ma, M. Winslett, Y. Yang, Yong Yu, Zhenjie Zhang

We propose Resa, a novel framework for robust, elastic and realtime stream processing in the cloud. In addition to traditional functionalities of streaming and cloud systems, Resa provides (i) a novel mechanism that handles dynamic additions and removals nodes in an operator, and (ii) a node re-assignment scheme that minimizes output latency using a queuing model. We have implemented Resa on top of Twitter Storm. Experiments using real data demonstrate the effectiveness and efficiency of Resa.

我们提出了Resa，一个在云中健壮、弹性和实时流处理的新框架。除了流媒体和云系统的传统功能之外，Resa还提供了(i)一种处理操作符中动态添加和删除节点的新机制，以及(ii)使用排队模型最小化输出延迟的节点重新分配方案。我们已经在Twitter Storm之上实现了Resa。实际数据的实验验证了该算法的有效性和高效性。

引用次数: 9

ε-Matching: event processing over noisy sequences in real time ε-匹配:实时处理噪声序列的事件

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463715

Zheng Li, Tingjian Ge, Cindy X. Chen

Regular expression matching over sequences in real time is a crucial task in complex event processing on data streams. Given that such data sequences are often noisy and errors have temporal and spatial correlations, performing regular expression matching effectively and efficiently is a challenging task. Instead of the traditional approach of learning a distribution of the stream first and then processing queries, we propose a new approach that efficiently does the matching based on an error model. In particular, our algorithms are based on the realistic Markov chain error model, and report all matching paths to trace relevant basic events that trigger the matching. This is much more informative than a single matching path. We also devise algorithms to efficiently return only top-k matching paths, and to handle negations in an extended regular expression. Finally, we conduct a comprehensive experimental study to evaluate our algorithms using real datasets.

在数据流的复杂事件处理中，实时正则表达式匹配是一项至关重要的任务。考虑到这些数据序列通常是有噪声的，并且错误具有时间和空间相关性，因此有效和高效地执行正则表达式匹配是一项具有挑战性的任务。与传统的先学习流的分布然后处理查询的方法不同，我们提出了一种基于错误模型的有效匹配的新方法。特别是，我们的算法基于现实的马尔可夫链误差模型，并报告所有匹配路径，以跟踪触发匹配的相关基本事件。这比单个匹配路径提供更多信息。我们还设计了一些算法来有效地只返回top-k匹配路径，并在扩展正则表达式中处理负数。最后，我们进行了一个全面的实验研究，使用真实的数据集来评估我们的算法。

引用次数: 11

Hekaton: SQL server's memory-optimized OLTP engine Hekaton: SQL server的内存优化OLTP引擎

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463710

C. Diaconu, Craig S. Freedman, Erik Ismert, P. Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, M. Zwilling

Hekaton is a new database engine optimized for memory resident data and OLTP workloads. Hekaton is fully integrated into SQL Server; it is not a separate system. To take advantage of Hekaton, a user simply declares a table memory optimized. Hekaton tables are fully transactional and durable and accessed using T-SQL in the same way as regular SQL Server tables. A query can reference both Hekaton tables and regular tables and a transaction can update data in both types of tables. T-SQL stored procedures that reference only Hekaton tables can be compiled into machine code for further performance improvements. The engine is designed for high con-currency. To achieve this it uses only latch-free data structures and a new optimistic, multiversion concurrency control technique. This paper gives an overview of the design of the Hekaton engine and reports some experimental results.

Hekaton是一个针对内存驻留数据和OLTP工作负载进行了优化的新数据库引擎。Hekaton完全集成到SQL Server中;它不是一个独立的系统。为了利用Hekaton，用户只需声明一个表内存优化。Hekaton表是完全事务性和持久性的，使用T-SQL访问的方式与常规SQL Server表相同。查询可以引用Hekaton表和普通表，事务可以更新这两种表中的数据。仅引用Hekaton表的T-SQL存储过程可以编译成机器码，以进一步提高性能。该引擎是为高并行货币而设计的。为了实现这一点，它只使用无锁存的数据结构和一种新的乐观的多版本并发控制技术。本文概述了Hekaton发动机的设计，并报告了一些实验结果。

引用次数: 476

The big data ecosystem at LinkedIn 领英的大数据生态系统

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463707

Roshan Sumbaly, J. Kreps, Sam Shah

The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the ``last mile'' issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

通过采用Hadoop等技术，大规模数据挖掘和机器学习的使用已经激增，Hadoop具有简单的编程语义和丰富而活跃的生态系统。本文介绍了LinkedIn基于hadoop的分析堆栈，它允许数据科学家和机器学习研究人员从大量数据中提取见解并构建产品功能。特别是，我们在提供丰富的开发者生态系统方面提出了我们对“最后一英里”问题的解决方案。这包括易于进出在线系统，以及将工作流作为生产流程进行管理。我们的解决方案的一个关键特征是这些分布式系统关注点完全从研究人员中抽象出来。例如，将数据部署回在线系统只是简单的一行Pig命令，数据科学家可以将其添加到脚本的末尾。我们还提供了案例研究，说明如何使用这个生态系统来解决各种问题，从推荐到新闻提要更新，从电子邮件消化到为我们的成员提供描述性分析仪表板。

引用次数: 140

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀