SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading

IF 2.2 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Database Systems Pub Date : 2015-10-23 DOI:10.1145/2818181

Yu Cheng, Florin Rusu

{"title":"SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading","authors":"Yu Cheng, Florin Rusu","doi":"10.1145/2818181","DOIUrl":null,"url":null,"abstract":"Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data (e.g., genomics), databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file.\n In this article, we propose SCANRAW, a novel database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query. We decompose loading and external table processing into atomic stages in order to identify common functionality. We analyze alternative implementations and discuss possible optimizations for each stage. Our major contribution is a parallel superscalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multicore processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database, such that subsequent queries execute faster. As a result, SCANRAW makes intelligent use of the available system resources—CPU cycles and I/O bandwidth—by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves the best-possible performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"38 3 1","pages":"19:1-19:45"},"PeriodicalIF":2.2000,"publicationDate":"2015-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2818181","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 20

Abstract

Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data (e.g., genomics), databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file. In this article, we propose SCANRAW, a novel database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query. We decompose loading and external table processing into atomic stages in order to identify common functionality. We analyze alternative implementations and discuss possible optimizations for each stage. Our major contribution is a parallel superscalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multicore processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database, such that subsequent queries execute faster. As a result, SCANRAW makes intelligent use of the available system resources—CPU cycles and I/O bandwidth—by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves the best-possible performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SCANRAW:用于并行原位处理和加载的数据库元操作符

由于在查询之前需要在系统内部加载数据，传统数据库会导致严重的数据到查询延迟。由于这在许多产生大量原始数据的领域(例如，基因组学)是不可接受的，因此数据库完全被丢弃。另一方面，外部表提供对原始文件的即时SQL查询。它们在查询工作负载中的性能受到整个文件的重复完整扫描、标记化和解析的速度的限制。在本文中，我们提出SCANRAW，这是一种新颖的数据库元操作符，用于对原始文件进行原位处理，它无缝地集成了数据加载和外部表，同时保留了它们的优点:跨查询工作负载的最佳性能和零查询时间。我们将加载和外部表处理分解为原子阶段，以便识别公共功能。我们分析了可选的实现，并讨论了每个阶段可能的优化。我们的主要贡献是一个并行的超标量管道实现，它允许SCANRAW通过重叠独立阶段的执行来利用当前的多核和多核处理器。此外，SCANRAW通过推测性地使用将数据存储到数据库的转换过程中产生的额外I/O带宽，使查询处理与加载重叠，从而使后续查询执行得更快。因此，SCANRAW通过在任务之间动态切换来智能地利用可用的系统资源(cpu周期和I/O带宽)，以确保实现最佳性能。我们在最先进的数据库系统中实现SCANRAW，并在各种合成数据集和实际数据集上评估其性能。我们的结果表明，具有推测加载的SCANRAW在处理过程中的任何一点上都可以实现查询序列的最佳性能。此外，SCANRAW在推测加载数据且不干扰正常查询处理的情况下最大化整个工作负载执行的资源利用率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.

期刊最新文献

Automated Category Tree Construction: Hardness Bounds and Algorithms Database Repairing with Soft Functional Dependencies Sharing Queries with Nonequivalent User-Defined Aggregate Functions A family of centrality measures for graph data based on subgraphs GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)