Proceedings. ACM-SIGMOD International Conference on Management of Data最新文献

英文中文

Adaptive log compression for massive log data 自适应日志压缩海量日志数据

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465341

Robert Christensen, Feifei Li

We present a novel adaptive log compression scheme. Results show 30% improvement on compression ratios over existing approaches.

提出了一种新的自适应日志压缩方案。结果表明，与现有方法相比，压缩比提高了30%。

引用次数: 13

LinkBench: a database benchmark based on the Facebook social graph LinkBench:一个基于Facebook社交图谱的数据库基准

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465296

Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, Mark D. Callaghan

Database benchmarks are an important tool for database researchers and practitioners that ease the process of making informed comparisons between different database hardware, software and configurations. Large scale web services such as social networks are a major and growing database application area, but currently there are few benchmarks that accurately model web service workloads. In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.

数据库基准测试是数据库研究人员和从业人员的重要工具，它简化了在不同数据库硬件、软件和配置之间进行明智比较的过程。诸如社交网络之类的大规模web服务是一个主要且不断增长的数据库应用程序领域，但目前很少有能够准确建模web服务工作负载的基准测试。在本文中，我们提出了一个新的合成基准，称为LinkBench。LinkBench基于主要社交网络Facebook存储“社交图谱”数据的生产数据库的痕迹。我们在许多维度上描述数据和查询工作负载，并使用所获得的见解来构建现实的综合基准。LinkBench为社会和web服务数据的持久存储提供了一个现实的、具有挑战性的测试，填补了研究人员、开发人员和管理员可用工具的空白。

引用次数: 304

ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality ODYS:一种使用DB-IR紧密集成的并行DBMS构建大规模并行搜索引擎的方法，用于实现更高级别的功能

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465316

K. Whang, Tae-Seob Yun, Yeon-Mi Yeo, I. Song, Hyuk-Yoon Kwon, In-Joong Kim

Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.

最近，并行搜索引擎已经基于可扩展的分布式文件系统(如Google file System)实现。然而，我们声称，使用并行DBMS构建大规模并行搜索引擎可能是一个有吸引力的选择，因为它支持比分布式文件系统更高级别(即sql级别)的接口，在提供可伸缩性的同时，更容易和更少出错的应用程序开发。关于更高级别的功能，我们可以将传统的O/S文件系统与DBMS进行比较。在本文中，我们提出了一种使用DB-IR紧密集成的并行DBMS构建大规模并行搜索引擎的新方法。为了评估性能，我们提出了并行搜索引擎的混合(即分析和实验)性能模型。我们认为，该模型可以使用从小规模(例如5节点)搜索引擎获得的实验结果准确地估计大规模并行(例如300节点)搜索引擎的性能。通过观察大部分查询处理时间花在从端(相对于主端和网络)以及根据实际测量估计从端花费的时间，我们表明模型和实际实验之间的估计误差小于2.13%。使用我们的模型，我们演示了我们架构的商业级可伸缩性和性能。我们建议的系统ODYS每天能够处理300亿个Web页面的10亿个查询(81个查询/秒)，仅使用43,472个节点，平均查询响应时间为194毫秒。通过使用两倍的节点(86,944)，ODYS可以提供148 ms的平均查询响应时间。这些结果表明，使用并行DBMS构建大规模并行搜索引擎是一种可行的方法，它具有支持高级(即DBMS级)、类似sql的编程接口的优点。

{"title":"ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality","authors":"K. Whang, Tae-Seob Yun, Yeon-Mi Yeo, I. Song, Hyuk-Yoon Kwon, In-Joong Kim","doi":"10.1145/2463676.2465316","DOIUrl":"https://doi.org/10.1145/2463676.2465316","url":null,"abstract":"Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82488125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Improving regular-expression matching on strings using negative factors 使用负因子改进字符串的正则表达式匹配

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465289

Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, Chen Li

The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.

在字符串上查找正则表达式(RE)匹配的问题存在于许多应用程序中，例如文本编辑、生物序列搜索和shell命令。现有的技术首先使用正则中的子字符串识别候选对象，然后使用自动机验证每个候选对象。当存在许多需要验证的候选事件时，这些技术变得低效。在本文中，我们提出了一种新的技术，利用负因子来修剪假阴性，负因子是不能出现在答案中的子串。该技术的一个主要优点是它可以与许多现有算法集成，以显着提高它们的效率。我们对这种技术作了详细说明。我们开发了一种有效的算法，利用负因子来修剪候选项，然后通过使用位运算并行处理负因子来改进它。我们证明，当负因子与必要因子(必须出现在每个答案中的子字符串)一起使用时，可以获得更好的修剪能力。我们分析了大量的负面因素，并开发了一种寻找少量高质量负面因素的算法。我们对该技术在真实数据集(包括DNA序列、蛋白质和文本文档)上进行了深入的实验研究，并在现有算法中应用该技术时显示出显着的性能改进。例如，它将流行的Gnu Grep工具对文本文档的搜索速度提高了11到74倍。

{"title":"Improving regular-expression matching on strings using negative factors","authors":"Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, Chen Li","doi":"10.1145/2463676.2465289","DOIUrl":"https://doi.org/10.1145/2463676.2465289","url":null,"abstract":"The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87676261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Less watts, more performance: an intelligent storage engine for data appliances 更少的功率，更高的性能:数据设备的智能存储引擎

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463685

L. Woods, J. Teubner, G. Alonso

In this demonstration, we present Ibex, a novel storage engine featuring hybrid, FPGA-accelerated query processing. In Ibex, an FPGA is inserted along the path between the storage devices and the database engine. The FPGA acts as an intelligent storage engine supporting query off-loading from the query engine. Apart from significant performance improvements for many common SQL queries, the demo will show how Ibex reduces data movement, CPU usage, and overall energy consumption in database appliances.

在本演示中，我们介绍了Ibex，一种具有混合fpga加速查询处理的新型存储引擎。在Ibex中，FPGA沿着存储设备和数据库引擎之间的路径插入。FPGA作为智能存储引擎，支持查询从查询引擎中卸载。除了对许多常见SQL查询进行显著的性能改进外，该演示还将展示Ibex如何减少数据库设备中的数据移动、CPU使用和总体能耗。

引用次数: 35

Latch-free data structures for DBMS: design, implementation, and evaluation 闩锁无数据结构的DBMS:设计，实现和评估

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463720

T. Horikawa

The fact that multi-core CPUs have become so common and that the number of CPU cores in one chip has continued to rise means that a server machine can easily contain an extremely high number of CPU cores. The CPU scalability of IT systems is thus attracting a considerable amount of research attention. Some systems, such as ACID-compliant DBMSs, are said to be difficult to scale, probably due to the mutual exclusion required to ensure data consistency. Possible countermeasures include latch-free (LF) data structures, an elemental technology to improve the CPU scalability by eliminating the need for mutual exclusion. This paper investigates these LF data structures with a particular focus on their applicability and effectiveness. Some existing LF data structures (such as LF hash tables) have been adapted to PostgreSQL, one of the most popular open-source DBMSs. The performance improvement was evaluated with a benchmark program simulating real-world transactions. Measurement results obtained from state-of-the-art 80-core machines demonstrated that the LF data structures were effective for performance improvement in a many-core situation in which DBT-1 throughput increased by about 2.5 times. Although the poor performance of the original DBMS was due to a severe latch-related bottleneck and can be improved by parameter tuning, it is of practical importance that LF data structures provided performance improvement without deep understanding of the target system behavior that is necessary for the parameter tuning.

事实上，多核CPU已经变得如此普遍，而且一个芯片上的CPU内核数量也在不断增加，这意味着一台服务器机器可以很容易地包含非常多的CPU内核。因此，IT系统的CPU可伸缩性引起了相当多的研究关注。有些系统，比如兼容acid的dbms，据说很难扩展，这可能是由于确保数据一致性所需的互斥。可能的对策包括无锁存(LF)数据结构，这是一种通过消除互斥需求来提高CPU可伸缩性的基本技术。本文对这些LF数据结构进行了研究，重点讨论了它们的适用性和有效性。一些现有的LF数据结构(比如LF哈希表)已经适应了PostgreSQL，这是最流行的开源dbms之一。性能改进是通过一个模拟真实世界事务的基准程序来评估的。从最先进的80核机器上获得的测量结果表明，LF数据结构对于多核情况下的性能改进是有效的，其中DBT-1吞吐量提高了约2.5倍。虽然原始DBMS的性能差是由于与锁存相关的严重瓶颈造成的，并且可以通过参数调优来改进，但LF数据结构在没有深入了解参数调优所必需的目标系统行为的情况下提供性能改进，这一点具有实际意义。

{"title":"Latch-free data structures for DBMS: design, implementation, and evaluation","authors":"T. Horikawa","doi":"10.1145/2463676.2463720","DOIUrl":"https://doi.org/10.1145/2463676.2463720","url":null,"abstract":"The fact that multi-core CPUs have become so common and that the number of CPU cores in one chip has continued to rise means that a server machine can easily contain an extremely high number of CPU cores. The CPU scalability of IT systems is thus attracting a considerable amount of research attention. Some systems, such as ACID-compliant DBMSs, are said to be difficult to scale, probably due to the mutual exclusion required to ensure data consistency. Possible countermeasures include latch-free (LF) data structures, an elemental technology to improve the CPU scalability by eliminating the need for mutual exclusion. This paper investigates these LF data structures with a particular focus on their applicability and effectiveness. Some existing LF data structures (such as LF hash tables) have been adapted to PostgreSQL, one of the most popular open-source DBMSs. The performance improvement was evaluated with a benchmark program simulating real-world transactions. Measurement results obtained from state-of-the-art 80-core machines demonstrated that the LF data structures were effective for performance improvement in a many-core situation in which DBT-1 throughput increased by about 2.5 times. Although the poor performance of the original DBMS was due to a severe latch-related bottleneck and can be improved by parameter tuning, it is of practical importance that LF data structures provided performance improvement without deep understanding of the target system behavior that is necessary for the parameter tuning.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78533702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

DeltaNI: an efficient labeling scheme for versioned hierarchical data DeltaNI:版本化分层数据的有效标记方案

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465329

Jan Finis, Robert Brunel, A. Kemper, Thomas Neumann, Franz Färber, Norman May

Main-memory database systems are emerging as the new backbone of business applications. Besides flat relational data representations also hierarchical ones are essential for these modern applications; therefore we devise a new indexing and versioning approach for hierarchies that is deeply integrated into the relational kernel. We propose the DeltaNI index as a versioned pendant of the nested intervals (NI) labeling scheme. The index is space- and time-efficient and yields a gapless, fixed-size integer NI labeling for each version while also supporting branching histories. In contrast to a naive NI labeling, it facilitates even complex updates of the tree structure. As many query processing techniques that work on top of the NI labeling have already been proposed, our index can be used as a building block for processing various kinds of queries. We evaluate the performance of the index on large inputs consisting of millions of nodes and thousands of versions. Thereby we show that DeltaNI scales well and can deliver satisfying performance for large business scenarios.

内存数据库系统正在成为业务应用程序的新支柱。除了平面关系数据表示外，层次化数据表示对于这些现代应用程序也是必不可少的;因此，我们为层次结构设计了一种新的索引和版本控制方法，该方法深度集成到关系内核中。我们提出DeltaNI索引作为嵌套区间(NI)标记方案的版本附属。该索引具有空间和时间效率，并为每个版本生成无间隙、固定大小的整数NI标记，同时还支持分支历史。与简单的NI标记相比，它甚至可以促进树结构的复杂更新。由于已经提出了许多在NI标记之上工作的查询处理技术，因此我们的索引可以用作处理各种查询的构建块。我们在包含数百万个节点和数千个版本的大型输入上评估索引的性能。因此，我们展示了DeltaNI可以很好地伸缩，并且可以为大型业务场景提供令人满意的性能。

引用次数: 15

Execution and optimization of continuous queries with cyclops 使用cyclops执行和优化连续查询

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465248

Harold Lim, S. Babu

As the data collected by enterprises grows in scale, there is a growing trend of performing data analytics on large datasets. Batch processing systems that can handle petabyte scale of data, such as Hadoop, have flourished and gained traction in the industry. As the results of batch analytics have been used to continuously improve front-facing user experience, there is a growing interest in pushing the processing latency down. This trend has fueled a resurgence in the development and usage of execution engines that can process continuous queries. An important class of continuous queries is windowed aggregation queries. Such queries arise in a wide range of applications such as generating personalized content and results. Today, considerable manual effort goes into finding the most suitable execution engine for these queries and on tuning query performance on these engines. An ecosystem composed of multiple execution engines may be needed in order to run the overall query workload efficiently given the diverse set of requirements that arise in practice. Cyclops is a continuous query processing platform that manages and orchestrates windowed aggregation queries in an ecosystem composed of multiple continuous query execution engines. Cyclops employs a cost-based approach for picking the most suitable engine and plan for executing a given query. This demonstration first presents an interactive visualization of the rich execution plan space of windowed aggregation queries, which allows users to analyze and understand the differences among plans. The next part of the demonstration will drill down into the design of Cyclops. For a given query, we show the cost spectrum of query execution plans across three different execution engines---Esper, Storm, and Hadoop---as estimated by Cyclops.

随着企业收集的数据规模越来越大，对大数据集进行数据分析的趋势越来越明显。可以处理pb级数据的批处理系统，如Hadoop，已经在业界蓬勃发展并获得了牵引力。由于批处理分析的结果已被用于不断改进面向前端的用户体验，因此人们对降低处理延迟的兴趣越来越大。这种趋势推动了能够处理连续查询的执行引擎的开发和使用的复兴。一类重要的连续查询是窗口聚合查询。此类查询出现在广泛的应用程序中，例如生成个性化的内容和结果。目前，大量的人工工作都用于为这些查询寻找最合适的执行引擎，并在这些引擎上调优查询性能。考虑到实践中出现的不同需求集，为了有效地运行整个查询工作负载，可能需要由多个执行引擎组成的生态系统。Cyclops是一个连续查询处理平台，在由多个连续查询执行引擎组成的生态系统中管理和编排窗口聚合查询。Cyclops采用基于成本的方法来选择最合适的引擎和执行给定查询的计划。这个演示首先展示了窗口聚合查询的丰富执行计划空间的交互式可视化，它允许用户分析和理解计划之间的差异。演示的下一部分将深入到独眼巨人的设计。对于给定的查询，我们显示了查询执行计划在三种不同执行引擎(Esper、Storm和Hadoop)上的成本谱，这是由Cyclops估计的。

{"title":"Execution and optimization of continuous queries with cyclops","authors":"Harold Lim, S. Babu","doi":"10.1145/2463676.2465248","DOIUrl":"https://doi.org/10.1145/2463676.2465248","url":null,"abstract":"As the data collected by enterprises grows in scale, there is a growing trend of performing data analytics on large datasets. Batch processing systems that can handle petabyte scale of data, such as Hadoop, have flourished and gained traction in the industry. As the results of batch analytics have been used to continuously improve front-facing user experience, there is a growing interest in pushing the processing latency down. This trend has fueled a resurgence in the development and usage of execution engines that can process continuous queries.\u0000 An important class of continuous queries is windowed aggregation queries. Such queries arise in a wide range of applications such as generating personalized content and results. Today, considerable manual effort goes into finding the most suitable execution engine for these queries and on tuning query performance on these engines. An ecosystem composed of multiple execution engines may be needed in order to run the overall query workload efficiently given the diverse set of requirements that arise in practice.\u0000 Cyclops is a continuous query processing platform that manages and orchestrates windowed aggregation queries in an ecosystem composed of multiple continuous query execution engines. Cyclops employs a cost-based approach for picking the most suitable engine and plan for executing a given query. This demonstration first presents an interactive visualization of the rich execution plan space of windowed aggregation queries, which allows users to analyze and understand the differences among plans. The next part of the demonstration will drill down into the design of Cyclops. For a given query, we show the cost spectrum of query execution plans across three different execution engines---Esper, Storm, and Hadoop---as estimated by Cyclops.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73053967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Column imprints: a secondary index structure 列印记:二级索引结构

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465306

Lefteris Sidirourgos, M. Kersten

Large scale data warehouses rely heavily on secondary indexes, such as bitmaps and b-trees, to limit access to slow IO devices. However, with the advent of large main memory systems, cache conscious secondary indexes are needed to improve also the transfer bandwidth between memory and cpu. In this paper, we introduce column imprint, a simple but efficient cache conscious secondary index. A column imprint is a collection of many small bit vectors, each indexing the data points of a single cacheline. An imprint is used during query evaluation to limit data access and thus minimize memory traffic. The compression for imprints is cpu friendly and exploits the empirical observation that data often exhibits local clustering or partial ordering as a side-effect of the construction process. Most importantly, column imprint compression remains effective and robust even in the case of unclustered data, while other state-of-the-art solutions fail. We conducted an extensive experimental evaluation to assess the applicability and the performance impact of the column imprints. The storage overhead, when experimenting with real world datasets, is just a few percent over the size of the columns being indexed. The evaluation time for over 40000 range queries of varying selectivity revealed the efficiency of the proposed index compared to zonemaps and bitmaps with WAH compression.

大型数据仓库严重依赖二级索引，如位图和b树，以限制对慢速IO设备的访问。然而，随着大型主存系统的出现，需要有缓存意识的辅助索引来提高内存和cpu之间的传输带宽。本文介绍了一种简单而高效的缓存感知二级索引——列印记。列印记是许多小的位向量的集合，每个位向量索引单个cacheline的数据点。在查询求值期间使用印记来限制数据访问，从而最小化内存流量。印记的压缩是cpu友好的，并且利用了数据经常表现出局部聚类或部分排序作为构造过程的副作用的经验观察。最重要的是，即使在非聚类数据的情况下，列印记压缩仍然有效和健壮，而其他最先进的解决方案则失败。我们进行了广泛的实验评估，以评估柱压印的适用性和性能影响。在对真实数据集进行实验时，存储开销仅比被索引的列的大小多几个百分点。对超过40000个不同选择性范围查询的评估时间表明，与使用WAH压缩的区域图和位图相比，所提出的索引的效率更高。

{"title":"Column imprints: a secondary index structure","authors":"Lefteris Sidirourgos, M. Kersten","doi":"10.1145/2463676.2465306","DOIUrl":"https://doi.org/10.1145/2463676.2465306","url":null,"abstract":"Large scale data warehouses rely heavily on secondary indexes, such as bitmaps and b-trees, to limit access to slow IO devices. However, with the advent of large main memory systems, cache conscious secondary indexes are needed to improve also the transfer bandwidth between memory and cpu. In this paper, we introduce column imprint, a simple but efficient cache conscious secondary index. A column imprint is a collection of many small bit vectors, each indexing the data points of a single cacheline. An imprint is used during query evaluation to limit data access and thus minimize memory traffic. The compression for imprints is cpu friendly and exploits the empirical observation that data often exhibits local clustering or partial ordering as a side-effect of the construction process. Most importantly, column imprint compression remains effective and robust even in the case of unclustered data, while other state-of-the-art solutions fail. We conducted an extensive experimental evaluation to assess the applicability and the performance impact of the column imprints. The storage overhead, when experimenting with real world datasets, is just a few percent over the size of the columns being indexed. The evaluation time for over 40000 range queries of varying selectivity revealed the efficiency of the proposed index compared to zonemaps and bitmaps with WAH compression.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74575245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

Building an efficient RDF store over a relational database 在关系数据库上构建高效的RDF存储

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463718

Mihaela A. Bornea, Julian T Dolby, Anastasios Kementsietsidis, Kavitha Srinivas, Patrick Dantressangle, O. Udrea, Bishwaranjan Bhattacharjee

Efficient storage and querying of RDF data is of increasing importance, due to the increased popularity and widespread acceptance of RDF on the web and in the enterprise. In this paper, we describe a novel storage and query mechanism for RDF which works on top of existing relational representations. Reliance on relational representations of RDF means that one can take advantage of 35+ years of research on efficient storage and querying, industrial-strength transaction support, locking, security, etc. However, there are significant challenges in storing RDF in relational, which include data sparsity and schema variability. We describe novel mechanisms to shred RDF into relational, and novel query translation techniques to maximize the advantages of this shredded representation. We show that these mechanisms result in consistently good performance across multiple RDF benchmarks, even when compared with current state-of-the-art stores. This work provides the basis for RDF support in DB2 v.10.1.

由于RDF在web和企业中的日益普及和广泛接受，RDF数据的高效存储和查询变得越来越重要。在本文中，我们描述了一种新的RDF存储和查询机制，该机制工作在现有的关系表示之上。对RDF的关系表示的依赖意味着可以利用35年以上在高效存储和查询、行业级事务支持、锁定、安全性等方面的研究成果。然而，在关系中存储RDF存在重大挑战，其中包括数据稀疏性和模式可变性。我们描述了将RDF分解为关系的新机制，以及将这种分解表示的优势最大化的新查询翻译技术。我们展示了这些机制可以在多个RDF基准测试中获得一致的良好性能，即使与当前最先进的存储进行比较也是如此。这项工作为DB2 v.10.1中的RDF支持奠定了基础。

引用次数: 233

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀