Proceedings. ACM-SIGMOD International Conference on Management of Data最新文献_第8页

Incremental mapping compilation in an object-to-relational mapping system 对象到关系映射系统中的增量映射编译

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465294

P. Bernstein, Marie Jacob, Jorge Pérez, Guillem Rull, James F. Terwilliger

In an object-to-relational mapping system (ORM), mapping expressions explain how to expose relational data as objects and how to store objects in tables. If mappings are sufficiently expressive, then it is possible to define lossy mappings. If a user updates an object, stores it in the database based on a lossy mapping, and then retrieves the object from the database, the user might get a different result than the updated state of the object; that is, the mapping might not "roundtrip." To avoid this, the ORM should validate that user-defined mappings roundtrip the data. However, this problem is NP-hard, so mapping validation can be very slow for large or complex mappings. We circumvent this problem by developing an incremental compiler for OR mappings. Given a validated mapping, a modification to the object schema is compiled into incremental modifications of the mapping. We define the problem formally, present algorithms to solve it for Microsoft's Entity Framework, and report on an implementation. For some mappings, incremental compilation is over 100 times faster than a full mapping compilation, in one case dropping from 8 hours to 50 seconds.

在对象到关系映射系统(ORM)中，映射表达式解释了如何将关系数据公开为对象以及如何在表中存储对象。如果映射具有足够的表现力，那么可以定义有损映射。如果用户更新了一个对象，基于有损映射将其存储在数据库中，然后从数据库中检索对象，则用户可能会得到与对象更新状态不同的结果;也就是说，映射可能不会“往返”。为了避免这种情况，ORM应该验证用户定义的映射对数据的往返。然而，这个问题是np困难的，因此对于大型或复杂的映射，映射验证可能非常缓慢。我们通过为OR映射开发一个增量编译器来规避这个问题。给定一个经过验证的映射，对对象模式的修改将被编译为对映射的增量修改。我们正式定义了这个问题，为微软的实体框架提供了解决它的算法，并报告了一个实现。对于某些映射，增量编译比完整映射编译快100倍以上，在一个例子中从8小时下降到50秒。

{"title":"Incremental mapping compilation in an object-to-relational mapping system","authors":"P. Bernstein, Marie Jacob, Jorge Pérez, Guillem Rull, James F. Terwilliger","doi":"10.1145/2463676.2465294","DOIUrl":"https://doi.org/10.1145/2463676.2465294","url":null,"abstract":"In an object-to-relational mapping system (ORM), mapping expressions explain how to expose relational data as objects and how to store objects in tables. If mappings are sufficiently expressive, then it is possible to define lossy mappings. If a user updates an object, stores it in the database based on a lossy mapping, and then retrieves the object from the database, the user might get a different result than the updated state of the object; that is, the mapping might not \"roundtrip.\" To avoid this, the ORM should validate that user-defined mappings roundtrip the data. However, this problem is NP-hard, so mapping validation can be very slow for large or complex mappings.\u0000 We circumvent this problem by developing an incremental compiler for OR mappings. Given a validated mapping, a modification to the object schema is compiled into incremental modifications of the mapping. We define the problem formally, present algorithms to solve it for Microsoft's Entity Framework, and report on an implementation. For some mappings, incremental compilation is over 100 times faster than a full mapping compilation, in one case dropping from 8 hours to 50 seconds.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91213625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

BitWeaving: fast scans for main memory data processing BitWeaving:快速扫描主存数据处理

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465322

Yinan Li, J. Patel

This paper focuses on running scans in a main memory data processing system at "bare metal" speed. Essentially, this means that the system must aim to process data at or near the speed of the processor (the fastest component in most system configurations). Scans are common in main memory data processing environments, and with the state-of-the-art techniques it still takes many cycles per input tuple to apply simple predicates on a single column of a table. In this paper, we propose a technique called BitWeaving that exploits the parallelism available at the bit level in modern processors. BitWeaving operates on multiple bits of data in a single cycle, processing bits from different columns in each cycle. Thus, bits from a batch of tuples are processed in each cycle, allowing BitWeaving to drop the cycles per column to below one in some case. BitWeaving comes in two flavors: BitWeaving/V which looks like a columnar organization but at the bit level, and BitWeaving/H which packs bits horizontally. In this paper we also develop the arithmetic framework that is needed to evaluate predicates using these BitWeaving organizations. Our experimental results show that both these methods produce significant performance benefits over the existing state-of-the-art methods, and in some cases produce over an order of magnitude in performance improvement.

本文主要研究在主存数据处理系统中以“裸机”速度运行扫描。本质上，这意味着系统必须以处理器(大多数系统配置中最快的组件)的速度或接近处理器的速度处理数据。扫描在主存数据处理环境中很常见，使用最先进的技术，在表的单个列上应用简单的谓词，每个输入元组仍然需要许多周期。在本文中，我们提出了一种称为BitWeaving的技术，它利用了现代处理器在位级上可用的并行性。BitWeaving在一个周期中对多个数据位进行操作，在每个周期中处理来自不同列的位。因此，在每个周期中处理一批元组中的位，允许BitWeaving在某些情况下将每列的周期降低到1以下。BitWeaving有两种风格:BitWeaving/V看起来像一个柱状组织，但在位级，BitWeaving/H是水平打包位。在本文中，我们还开发了使用这些BitWeaving组织评估谓词所需的算术框架。我们的实验结果表明，这两种方法都比现有的最先进的方法产生了显著的性能优势，并且在某些情况下产生了超过数量级的性能改进。

{"title":"BitWeaving: fast scans for main memory data processing","authors":"Yinan Li, J. Patel","doi":"10.1145/2463676.2465322","DOIUrl":"https://doi.org/10.1145/2463676.2465322","url":null,"abstract":"This paper focuses on running scans in a main memory data processing system at \"bare metal\" speed. Essentially, this means that the system must aim to process data at or near the speed of the processor (the fastest component in most system configurations). Scans are common in main memory data processing environments, and with the state-of-the-art techniques it still takes many cycles per input tuple to apply simple predicates on a single column of a table. In this paper, we propose a technique called BitWeaving that exploits the parallelism available at the bit level in modern processors. BitWeaving operates on multiple bits of data in a single cycle, processing bits from different columns in each cycle. Thus, bits from a batch of tuples are processed in each cycle, allowing BitWeaving to drop the cycles per column to below one in some case. BitWeaving comes in two flavors: BitWeaving/V which looks like a columnar organization but at the bit level, and BitWeaving/H which packs bits horizontally. In this paper we also develop the arithmetic framework that is needed to evaluate predicates using these BitWeaving organizations. Our experimental results show that both these methods produce significant performance benefits over the existing state-of-the-art methods, and in some cases produce over an order of magnitude in performance improvement.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91221025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 147

Discovering XSD keys from XML data 从XML数据中发现XSD键

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463705

M. Arenas, Jonny Daenen, F. Neven, M. Ugarte, J. V. D. Bussche, Stijn Vansummeren

A great deal of research into the learning of schemas from XML data has been conducted in recent years to enable the automatic discovery of XML Schemas from XML documents when no schema, or only a low-quality one is available. Unfortunately, and in strong contrast to, for instance, the relational model, the automatic discovery of even the simplest of XML constraints, namely XML keys, has been left largely unexplored in this context. A major obstacle here is the unavailability of a theory on reasoning about XML keys in the presence of XML schemas, which is needed to validate the quality of candidate keys. The present paper embarks on a fundamental study of such a theory and classifies the complexity of several crucial properties concerning XML keys in the presence of an XSD, like, for instance, testing for consistency, boundedness, satisfiability, universality, and equivalence. Of independent interest, novel results are obtained related to cardinality estimation of XPath result sets. A mining algorithm is then developed within the framework of levelwise search. The algorithm leverages known discovery algorithms for functional dependencies in the relational model, but incorporates the above mentioned properties to assess and refine the quality of derived keys. An experimental study on an extensive body of real world XML data evaluating the effectiveness of the proposed algorithm is provided.

近年来，人们对从XML数据中学习模式进行了大量研究，以便在没有模式或只有低质量模式可用时，能够从XML文档中自动发现XML模式。不幸的是，与关系模型(例如)形成强烈对比的是，即使是最简单的XML约束(即XML键)的自动发现在这种上下文中也基本上没有得到探索。这里的一个主要障碍是，在存在XML模式的情况下，没有关于XML键的推理理论，这是验证候选键的质量所必需的。本文开始对这种理论进行基础研究，并对存在XSD的情况下涉及XML键的几个关键属性的复杂性进行分类，例如，一致性、有界性、可满足性、通用性和等价性的测试。独立有趣的是，获得了与XPath结果集的基数估计相关的新结果。然后在分层搜索的框架内开发了一种挖掘算法。该算法利用已知的发现算法来处理关系模型中的功能依赖项，但合并了上面提到的属性来评估和改进派生键的质量。对大量真实XML数据进行了实验研究，以评估所提出算法的有效性。

{"title":"Discovering XSD keys from XML data","authors":"M. Arenas, Jonny Daenen, F. Neven, M. Ugarte, J. V. D. Bussche, Stijn Vansummeren","doi":"10.1145/2463676.2463705","DOIUrl":"https://doi.org/10.1145/2463676.2463705","url":null,"abstract":"A great deal of research into the learning of schemas from XML data has been conducted in recent years to enable the automatic discovery of XML Schemas from XML documents when no schema, or only a low-quality one is available. Unfortunately, and in strong contrast to, for instance, the relational model, the automatic discovery of even the simplest of XML constraints, namely XML keys, has been left largely unexplored in this context. A major obstacle here is the unavailability of a theory on reasoning about XML keys in the presence of XML schemas, which is needed to validate the quality of candidate keys. The present paper embarks on a fundamental study of such a theory and classifies the complexity of several crucial properties concerning XML keys in the presence of an XSD, like, for instance, testing for consistency, boundedness, satisfiability, universality, and equivalence. Of independent interest, novel results are obtained related to cardinality estimation of XPath result sets. A mining algorithm is then developed within the framework of levelwise search. The algorithm leverages known discovery algorithms for functional dependencies in the relational model, but incorporates the above mentioned properties to assess and refine the quality of derived keys. An experimental study on an extensive body of real world XML data evaluating the effectiveness of the proposed algorithm is provided.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91287966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fact checking and analyzing the web 事实核查和分析网络

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463692

François Goasdoué, Konstantinos Karanasos, Yannis Katsis, J. Leblay, I. Manolescu, Stamatis Zampetakis

Fact checking and data journalism are currently strong trends. The sheer amount of data at hand makes it difficult even for trained professionals to spot biased, outdated or simply incorrect information. We propose to demonstrate FactMinder, a fact checking and analysis assistance application. SIGMOD attendees will be able to analyze documents using FactMinder and experience how background knowledge and open data repositories help build insightful overviews of current topics.

事实核查和数据新闻是目前的大趋势。手头的大量数据使得即使是训练有素的专业人员也很难发现有偏见的、过时的或根本不正确的信息。我们建议演示FactMinder，一个事实检查和分析辅助应用程序。SIGMOD与会者将能够使用FactMinder分析文档，并体验背景知识和开放数据存储库如何帮助构建当前主题的深刻概述。

引用次数: 34

Integrating scale out and fault tolerance in stream processing using operator state management 利用算子状态管理集成了流处理中的横向扩展和容错

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465282

R. Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, P. Pietzuch

As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs-systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

随着“大数据”应用程序的用户期待新的结果，我们见证了一种新型的流处理系统(SPS)，它被设计成可扩展到大量云托管机器。这些系统面临着新的挑战:(i)为了从“按需付费”的云计算模式中获益，它们必须按需扩展，在工作量增加时获得额外的虚拟机(vm)和并行操作;(ii)在数百台虚拟机上部署时，故障很常见——系统必须具有容错性，恢复时间快，但每台机器的开销要低。一个悬而未决的问题是，当流查询包含有状态操作符(必须在不影响查询结果的情况下向外扩展和恢复)时，如何实现这两个目标。我们的关键思想是通过一组状态管理原语显式地向SPS公开内部操作符状态。在此基础上，提出了一种动态扩展和恢复状态算子的集成方法。外部化的操作员状态由SPS定期检查，并备份到上游vm。SPS识别单个运营商的瓶颈，并通过分配新的vm和对检查点状态进行分区来自动扩展这些瓶颈。在任何时候，通过在新VM上恢复检查点状态并重播未处理的元组来恢复失败的操作符。我们用Amazon EC2云平台上的线性道路基准测试(Linear Road Benchmark)对这种方法进行了评估，结果表明，它可以自动扩展到50个vm的负载因子L=350，同时从故障中快速恢复。

{"title":"Integrating scale out and fault tolerance in stream processing using operator state management","authors":"R. Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, P. Pietzuch","doi":"10.1145/2463676.2465282","DOIUrl":"https://doi.org/10.1145/2463676.2465282","url":null,"abstract":"As users of \"big data\" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the \"pay-as-you-go\" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs-systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.\u0000 Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79723835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 360

Provenance-based dictionary refinement in information extraction 信息提取中基于词源的字典细化

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465284

Sudeepa Roy, Laura Chiticariu, V. Feldman, Frederick Reiss, Huaiyu Zhu

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results. In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

术语和短语词典(例如，普通的人或组织名称)对于从非结构化文本中提取结构化信息的信息提取系统是不可或缺的。即使在使用高度精确和复杂的提取规则时，使用嘈杂的或未经改进的字典也可能导致许多不正确的结果。通常，系统的结果以任意复杂的方式依赖于字典条目，删除一组条目可以同时删除正确和不正确的结果。此外，任何这样的改进都需要对结果进行艰苦的手工标记。在本文中，我们研究了字典优化问题并解决了上述挑战。使用字典条目的输出来源，我们形式化了一个优化问题，即相对于精炼字典最大化系统质量，研究了该问题的复杂性，并给出了有效的算法。我们还提出了解决方案，以解决不完全标记的结果，我们估计缺失的标签假设一个统计模型。最后，我们使用几个真实世界的提取器和竞争数据集进行了详细的实验评估，以验证我们的解决方案。除了信息提取之外，我们的基于来源的技术和解决方案还可以在一般关系设置中的视图维护中找到应用。

{"title":"Provenance-based dictionary refinement in information extraction","authors":"Sudeepa Roy, Laura Chiticariu, V. Feldman, Frederick Reiss, Huaiyu Zhu","doi":"10.1145/2463676.2465284","DOIUrl":"https://doi.org/10.1145/2463676.2465284","url":null,"abstract":"Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results.\u0000 In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82972842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

TOUCH: in-memory spatial join by hierarchical data-oriented partitioning TOUCH:通过分层的面向数据的分区进行内存空间连接

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463700

Sadegh Heyrani-Nobari, F. Tauheed, T. Heinis, Panagiotis Karras, S. Bressan, A. Ailamaki

Efficient spatial joins are pivotal for many applications and particularly important for geographical information systems or for the simulation sciences where scientists work with spatial models. Past research has primarily focused on disk-based spatial joins; efficient in-memory approaches, however, are important for two reasons: a) main memory has grown so large that many datasets fit in it and b) the in-memory join is a very time-consuming part of all disk-based spatial joins. In this paper we develop TOUCH, a novel in-memory spatial join algorithm that uses hierarchical data-oriented space partitioning, thereby keeping both its memory footprint and the number of comparisons low. Our results show that TOUCH outperforms known in-memory spatial-join algorithms as well as in-memory implementations of disk-based join approaches. In particular, it has a one order of magnitude advantage over the memory-demanding state of the art in terms of number of comparisons (i.e., pairwise object comparisons), as well as execution time, while it is two orders of magnitude faster when compared to approaches with a similar memory footprint. Furthermore, TOUCH is more scalable than competing approaches as data density grows.

高效的空间连接对于许多应用都是至关重要的，对于地理信息系统或科学家使用空间模型的模拟科学尤其重要。过去的研究主要集中在基于磁盘的空间连接;然而，高效的内存方法很重要，有两个原因:a)主内存已经变得非常大，以至于许多数据集都可以放入其中;b)内存连接是所有基于磁盘的空间连接中非常耗时的一部分。在本文中，我们开发了一种新的内存空间连接算法TOUCH，它使用分层的面向数据的空间分区，从而使其内存占用和比较次数都很低。我们的结果表明，TOUCH优于已知的内存空间连接算法以及基于磁盘的连接方法的内存实现。特别是，在比较次数(即成对对象比较)和执行时间方面，它比当前对内存要求较高的状态有一个数量级的优势，而与具有类似内存占用的方法相比，它要快两个数量级。此外，随着数据密度的增长，TOUCH比其他竞争方法更具可扩展性。

{"title":"TOUCH: in-memory spatial join by hierarchical data-oriented partitioning","authors":"Sadegh Heyrani-Nobari, F. Tauheed, T. Heinis, Panagiotis Karras, S. Bressan, A. Ailamaki","doi":"10.1145/2463676.2463700","DOIUrl":"https://doi.org/10.1145/2463676.2463700","url":null,"abstract":"Efficient spatial joins are pivotal for many applications and particularly important for geographical information systems or for the simulation sciences where scientists work with spatial models. Past research has primarily focused on disk-based spatial joins; efficient in-memory approaches, however, are important for two reasons: a) main memory has grown so large that many datasets fit in it and b) the in-memory join is a very time-consuming part of all disk-based spatial joins.\u0000 In this paper we develop TOUCH, a novel in-memory spatial join algorithm that uses hierarchical data-oriented space partitioning, thereby keeping both its memory footprint and the number of comparisons low. Our results show that TOUCH outperforms known in-memory spatial-join algorithms as well as in-memory implementations of disk-based join approaches. In particular, it has a one order of magnitude advantage over the memory-demanding state of the art in terms of number of comparisons (i.e., pairwise object comparisons), as well as execution time, while it is two orders of magnitude faster when compared to approaches with a similar memory footprint. Furthermore, TOUCH is more scalable than competing approaches as data density grows.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83499498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

DBalancer: distributed load balancing for NoSQL data-stores DBalancer: NoSQL数据存储分布式负载均衡

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465232

I. Konstantinou, Dimitrios Tsoumakos, Ioannis Mytilinis, N. Koziris

Unanticipated load spikes or skewed data access patterns may lead to severe performance degradation in data serving applications, a typical problem of distributed NoSQL data-stores. In these cases, load balancing is a necessary operation. In this demonstration, we present the DBalancer, a generic distributed module that can be installed on top of a typical NoSQL data-store and provide an efficient and highly configurable load balancing mechanism. Balancing is performed by simple message exchanges and typical data movement operations supported by most modern NoSQL data-stores. We present the system's architecture, we describe in detail its modules and their interaction and we implement a suite of different algorithms on top of it. Through a web-based interactive GUI we allow the users to launch NoSQL clusters of various sizes, to apply numerous skewed and dynamic workloads and to compare the implemented load balancing algorithms. Videos and graphs showcasing each algorithm's effect on a number of indicative performance and cost metrics will be created on the fly for every setup. By browsing the results of different executions users will be able to grasp each algorithm's balancing mechanisms and performance impact in a number of representative setups.

意外的负载峰值或倾斜的数据访问模式可能导致数据服务应用程序的严重性能下降，这是分布式NoSQL数据存储的典型问题。在这些情况下，负载平衡是必要的操作。在本演示中，我们介绍了DBalancer，这是一个通用的分布式模块，可以安装在典型的NoSQL数据存储之上，并提供高效且高度可配置的负载平衡机制。平衡是通过简单的消息交换和大多数现代NoSQL数据存储支持的典型数据移动操作来实现的。我们给出了系统的体系结构，详细描述了系统的模块及其交互，并在此基础上实现了一套不同的算法。通过基于web的交互式GUI，我们允许用户启动各种大小的NoSQL集群，应用各种倾斜和动态工作负载，并比较实现的负载平衡算法。视频和图表将显示每种算法对许多指示性性能和成本指标的影响，并将为每个设置动态创建。通过浏览不同执行的结果，用户将能够掌握每个算法在许多代表性设置中的平衡机制和性能影响。

{"title":"DBalancer: distributed load balancing for NoSQL data-stores","authors":"I. Konstantinou, Dimitrios Tsoumakos, Ioannis Mytilinis, N. Koziris","doi":"10.1145/2463676.2465232","DOIUrl":"https://doi.org/10.1145/2463676.2465232","url":null,"abstract":"Unanticipated load spikes or skewed data access patterns may lead to severe performance degradation in data serving applications, a typical problem of distributed NoSQL data-stores. In these cases, load balancing is a necessary operation. In this demonstration, we present the DBalancer, a generic distributed module that can be installed on top of a typical NoSQL data-store and provide an efficient and highly configurable load balancing mechanism. Balancing is performed by simple message exchanges and typical data movement operations supported by most modern NoSQL data-stores. We present the system's architecture, we describe in detail its modules and their interaction and we implement a suite of different algorithms on top of it. Through a web-based interactive GUI we allow the users to launch NoSQL clusters of various sizes, to apply numerous skewed and dynamic workloads and to compare the implemented load balancing algorithms. Videos and graphs showcasing each algorithm's effect on a number of indicative performance and cost metrics will be created on the fly for every setup. By browsing the results of different executions users will be able to grasp each algorithm's balancing mechanisms and performance impact in a number of representative setups.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81890392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

PBS at work: advancing data management with consistency metrics PBS在工作:用一致性指标推进数据管理

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465260

Peter D. Bailis, S. Venkataraman, M. Franklin, J. Hellerstein, I. Stoica

A large body of recent work has proposed analytical and empirical techniques for quantifying the data consistency properties of distributed data stores. In this demonstration, we begin to explore the wide range of new database functionality they enable, including dynamic query tuning, consistency SLAs, monitoring, and administration. Our demonstration will exhibit how both application programmers and database administrators can leverage these features. We describe three major application scenarios and present a system architecture for supporting them. We also describe our experience in integrating Probabilistically Bounded Staleness (PBS) predictions into Cassandra, a popular NoSQL store and sketch a demo platform that will allow SIGMOD attendees to experience the importance and applicability of real-time consistency metrics.

最近的大量工作已经提出了用于量化分布式数据存储的数据一致性属性的分析和经验技术。在本演示中，我们将开始探索它们支持的广泛的新数据库功能，包括动态查询调优、一致性sla、监视和管理。我们的演示将展示应用程序程序员和数据库管理员如何利用这些特性。我们描述了三种主要的应用场景，并给出了支持它们的系统架构。我们还描述了我们将概率有界过期(probabilisbounded Staleness, PBS)预测集成到Cassandra(一个流行的NoSQL存储)中的经验，并概述了一个演示平台，该平台将允许SIGMOD与会者体验实时一致性指标的重要性和适用性。

引用次数: 11

Calibrating trajectory data for similarity-based analysis 校准轨迹数据，用于基于相似性的分析

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465303

Han Su, Kai Zheng, Haozhou Wang, Jiamin Huang, Xiaofang Zhou

Due to the prevalence of GPS-enabled devices and wireless communications technologies, spatial trajectories that describe the movement history of moving objects are being generated and accumulated at an unprecedented pace. Trajectory data in a database are intrinsically heterogeneous, as they represent discrete approximations of original continuous paths derived using different sampling strategies and different sampling rates. Such heterogeneity can have a negative impact on the effectiveness of trajectory similarity measures, which are the basis of many crucial trajectory processing tasks. In this paper, we pioneer a systematic approach to trajectory calibration that is a process to transform a heterogeneous trajectory dataset to one with (almost) unified sampling strategies. Specifically, we propose an anchor-based calibration system that aligns trajectories to a set of anchor points, which are fixed locations independent of trajectory data. After examining four different types of anchor points for the purpose of building a stable reference system, we propose a geometry-based calibration approach that considers the spatial relationship between anchor points and trajectories. Then a more advanced model-based calibration method is presented, which exploits the power of machine learning techniques to train inference models from historical trajectory data to improve calibration effectiveness. Finally, we conduct extensive experiments using real trajectory datasets to demonstrate the effectiveness and efficiency of the proposed calibration system.

由于gps设备和无线通信技术的普及，描述运动物体运动历史的空间轨迹正在以前所未有的速度生成和积累。数据库中的轨迹数据本质上是异构的，因为它们代表了使用不同采样策略和不同采样率得出的原始连续路径的离散近似。这种异质性会对轨迹相似性度量的有效性产生负面影响，而轨迹相似性度量是许多关键轨迹处理任务的基础。在本文中，我们开创了一种系统的轨迹校准方法，这是一个将异构轨迹数据集转换为具有(几乎)统一采样策略的过程。具体来说，我们提出了一种基于锚点的校准系统，该系统将轨迹对准一组锚点，锚点是独立于轨迹数据的固定位置。在研究了四种不同类型的锚点以建立稳定的参考系统之后，我们提出了一种基于几何的校准方法，该方法考虑了锚点与轨迹之间的空间关系。然后提出了一种更先进的基于模型的校准方法，该方法利用机器学习技术的力量从历史轨迹数据中训练推理模型，以提高校准效率。最后，我们使用真实的轨迹数据集进行了大量的实验，以证明所提出的校准系统的有效性和效率。

{"title":"Calibrating trajectory data for similarity-based analysis","authors":"Han Su, Kai Zheng, Haozhou Wang, Jiamin Huang, Xiaofang Zhou","doi":"10.1145/2463676.2465303","DOIUrl":"https://doi.org/10.1145/2463676.2465303","url":null,"abstract":"Due to the prevalence of GPS-enabled devices and wireless communications technologies, spatial trajectories that describe the movement history of moving objects are being generated and accumulated at an unprecedented pace. Trajectory data in a database are intrinsically heterogeneous, as they represent discrete approximations of original continuous paths derived using different sampling strategies and different sampling rates. Such heterogeneity can have a negative impact on the effectiveness of trajectory similarity measures, which are the basis of many crucial trajectory processing tasks. In this paper, we pioneer a systematic approach to trajectory calibration that is a process to transform a heterogeneous trajectory dataset to one with (almost) unified sampling strategies. Specifically, we propose an anchor-based calibration system that aligns trajectories to a set of anchor points, which are fixed locations independent of trajectory data. After examining four different types of anchor points for the purpose of building a stable reference system, we propose a geometry-based calibration approach that considers the spatial relationship between anchor points and trajectories. Then a more advanced model-based calibration method is presented, which exploits the power of machine learning techniques to train inference models from historical trajectory data to improve calibration effectiveness. Finally, we conduct extensive experiments using real trajectory datasets to demonstrate the effectiveness and efficiency of the proposed calibration system.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79296825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97