首页 > 最新文献

Proceedings of the 2018 International Conference on Management of Data最新文献

英文 中文
Lightweight Cardinality Estimation in LSM-based Systems 基于lsm系统的轻量级基数估计
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183761
Ildar Absalyamov, M. Carey, V. Tsotras
Data sources, such as social media, mobile apps and IoT sensors, generate billions of records each day. Keeping up with this influx of data while providing useful analytics to the users is a major challenge for today's data-intensive systems. A popular solution that allows such systems to handle rapidly incoming data is to rely on log-structured merge (LSM) storage models. LSM-based systems provide a tunable trade-off between ingesting vast amounts of data at a high rate and running efficient analytical queries on top of that data. For queries, it is well-known that the query processing performance largely depends on the ability to generate efficient execution plans. Previous research showed that OLAP query workloads rely on having small, yet precise, statistical summaries of the underlying data, which can drive the cost-based query optimization. In this paper we address the problem of computing data statistics for workloads with rapid data ingestion and propose a lightweight statistics-collection framework that exploits the properties of LSM storage. Our approach is designed to piggyback on the events (flush and merge) of the LSM lifecycle. This allows us to easily create an initial statistics and then keep them in sync with rapidly changing data while minimizing the overhead to the existing system. We have implemented and adapted well-known algorithms to produce various types of statistical synopses, including equi-width histograms, equi-height histograms, and wavelets. We performed an in-depth empirical evaluation that considers both the cardinality estimation accuracy and runtime overheads of collecting and using statistics. The experiments were conducted by prototyping our approach on top of Apache AsterixDB, an open source Big Data management system that has an entirely LSM-based storage backend.
社交媒体、移动应用程序和物联网传感器等数据源每天产生数十亿条记录。在向用户提供有用分析的同时,跟上数据的涌入是当今数据密集型系统面临的主要挑战。允许此类系统快速处理传入数据的流行解决方案是依赖于日志结构合并(LSM)存储模型。基于lsm的系统在高速摄取大量数据和在这些数据之上运行高效的分析查询之间提供了一种可调的权衡。对于查询,众所周知,查询处理性能在很大程度上取决于生成高效执行计划的能力。以前的研究表明,OLAP查询工作负载依赖于底层数据的小而精确的统计摘要,这可以推动基于成本的查询优化。在本文中,我们解决了具有快速数据摄取的工作负载的数据统计计算问题,并提出了一个利用LSM存储属性的轻量级统计收集框架。我们的方法旨在利用LSM生命周期的事件(刷新和合并)。这使我们能够轻松地创建初始统计数据,然后使它们与快速变化的数据保持同步,同时最大限度地减少对现有系统的开销。我们已经实现并调整了众所周知的算法来产生各种类型的统计概要,包括等宽直方图,等高直方图和小波。我们进行了深入的经验评估,考虑了基数估计准确性和收集和使用统计数据的运行时开销。我们的实验是在Apache AsterixDB(一个开源的大数据管理系统,完全基于lsm的存储后端)上进行的。
{"title":"Lightweight Cardinality Estimation in LSM-based Systems","authors":"Ildar Absalyamov, M. Carey, V. Tsotras","doi":"10.1145/3183713.3183761","DOIUrl":"https://doi.org/10.1145/3183713.3183761","url":null,"abstract":"Data sources, such as social media, mobile apps and IoT sensors, generate billions of records each day. Keeping up with this influx of data while providing useful analytics to the users is a major challenge for today's data-intensive systems. A popular solution that allows such systems to handle rapidly incoming data is to rely on log-structured merge (LSM) storage models. LSM-based systems provide a tunable trade-off between ingesting vast amounts of data at a high rate and running efficient analytical queries on top of that data. For queries, it is well-known that the query processing performance largely depends on the ability to generate efficient execution plans. Previous research showed that OLAP query workloads rely on having small, yet precise, statistical summaries of the underlying data, which can drive the cost-based query optimization. In this paper we address the problem of computing data statistics for workloads with rapid data ingestion and propose a lightweight statistics-collection framework that exploits the properties of LSM storage. Our approach is designed to piggyback on the events (flush and merge) of the LSM lifecycle. This allows us to easily create an initial statistics and then keep them in sync with rapidly changing data while minimizing the overhead to the existing system. We have implemented and adapted well-known algorithms to produce various types of statistical synopses, including equi-width histograms, equi-height histograms, and wavelets. We performed an in-depth empirical evaluation that considers both the cardinality estimation accuracy and runtime overheads of collecting and using statistics. The experiments were conducted by prototyping our approach on top of Apache AsterixDB, an open source Big Data management system that has an entirely LSM-based storage backend.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89775183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Persistent Bloom Filter: Membership Testing for the Entire History 持久布隆过滤器:整个历史的成员测试
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183737
Yanqing Peng, Jinwei Guo, Feifei Li, Weining Qian, Aoying Zhou
Membership testing is the problem of testing whether an element is in a set of elements. Performing the test exactly is expensive space-wise, requiring the storage of all elements in a set. In many applications, an approximate testing that can be done quickly using small space is often desired. Bloom filter (BF) was designed and has witnessed great success across numerous application domains. But there is no compact structure that supports set membership testing for temporal queries, e.g., has person A visited a web server between 9:30am and 9:40am? And has the same person visited the web server again between 9:45am and 9:50am? It is possible to support such "temporal membership testing" using a BF, but we will show that this is fairly expensive. To that end, this paper designs persistent bloom filter (PBF), a novel data structure for temporal membership testing with compact space.
成员测试是测试一个元素是否在一组元素中。准确地执行测试在空间方面是昂贵的,需要存储集合中的所有元素。在许多应用中,通常需要使用较小的空间快速完成近似测试。布隆过滤器(BF)的设计并在许多应用领域取得了巨大的成功。但是没有紧凑的结构来支持时间查询的集合成员测试,例如,A是否在上午9:30到9:40之间访问了web服务器?在上午9:45到9:50之间,同一个人是否再次访问了web服务器?使用BF支持这样的“临时成员测试”是可能的,但是我们将说明这是相当昂贵的。为此,本文设计了持久布隆滤波器(PBF),这是一种新颖的紧凑空间时间隶属度测试数据结构。
{"title":"Persistent Bloom Filter: Membership Testing for the Entire History","authors":"Yanqing Peng, Jinwei Guo, Feifei Li, Weining Qian, Aoying Zhou","doi":"10.1145/3183713.3183737","DOIUrl":"https://doi.org/10.1145/3183713.3183737","url":null,"abstract":"Membership testing is the problem of testing whether an element is in a set of elements. Performing the test exactly is expensive space-wise, requiring the storage of all elements in a set. In many applications, an approximate testing that can be done quickly using small space is often desired. Bloom filter (BF) was designed and has witnessed great success across numerous application domains. But there is no compact structure that supports set membership testing for temporal queries, e.g., has person A visited a web server between 9:30am and 9:40am? And has the same person visited the web server again between 9:45am and 9:50am? It is possible to support such \"temporal membership testing\" using a BF, but we will show that this is fairly expensive. To that end, this paper designs persistent bloom filter (PBF), a novel data structure for temporal membership testing with compact space.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"25 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88286829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Columnstore and B+ tree - Are Hybrid Physical Designs Important? 列存储和B+树-混合物理设计重要吗?
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3190660
Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, M. Syamala
Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied --- a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.
商业dbms(如Microsoft SQL Server)可以满足各种工作负载,包括事务处理、决策支持和操作分析。它们还支持多种物理设计结构,如B+树和columnstore。B+树对于OLTP工作负载和columnstore对于决策支持工作负载的好处是众所周知的。然而,混合物理设计(由同一数据库上的columnstore和B+树索引组成)的重要性并没有得到很好的研究——这是本文的一个重点。我们首先使用精心设计的微基准来量化权衡。这种微基准测试表明,根据工作负载的不同,混合物理设计可以带来数量级的性能提升。对于复杂的实际应用程序,为数据库工作负载选择合适的columnstore和B+树索引组合是一项挑战。我们扩展了Microsoft SQL Server的数据库引擎优化顾问,为给定的工作负载推荐B+树和columnstore索引的合适组合。通过使用行业标准基准测试和几个真实客户工作负载的大量实验,我们量化了能够推荐混合物理设计的物理设计工具如何比依赖仅列存储或仅B+树设计的方法带来更好的执行成本。
{"title":"Columnstore and B+ tree - Are Hybrid Physical Designs Important?","authors":"Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, M. Syamala","doi":"10.1145/3183713.3190660","DOIUrl":"https://doi.org/10.1145/3183713.3190660","url":null,"abstract":"Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied --- a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"204 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72562732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Session details: Keynote 2 会议详情:主题演讲2
Xinyue Dong
{"title":"Session details: Keynote 2","authors":"Xinyue Dong","doi":"10.1145/3258012","DOIUrl":"https://doi.org/10.1145/3258012","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75739739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
P-Store: An Elastic Database System with Predictive Provisioning P-Store:具有预测供应的弹性数据库系统
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3190650
Rebecca Taft, Nosayba El-Sayed, M. Serafini, Yu Lu, Ashraf Aboulnaga, M. Stonebraker, Ricardo Mayerhofer, Francisco Jose Andrade
OLTP database systems are a critical part of the operation of many enterprises. Such systems are often configured statically with sufficient capacity for peak load. For many OLTP applications, however, the maximum load is an order of magnitude larger than the minimum, and load varies in a repeating daily pattern. It is thus prudent to allocate computing resources dynamically to match demand. One can allocate resources reactively after a load increase is detected, but this places additional burden on the already-overloaded system to reconfigure. A predictive allocation, in advance of load increases, is clearly preferable. We present P-Store, the first elastic OLTP DBMS to use prediction, and apply it to the workload of B2W Digital (B2W), a large online retailer. Our study shows that P-Store outperforms a reactive system on B2W's workload by causing 72% fewer latency violations, and achieves performance comparable to static allocation for peak demand while using 50% fewer servers.
OLTP数据库系统是许多企业运营的关键部分。这样的系统通常静态配置,具有足够的峰值负载容量。但是,对于许多OLTP应用程序,最大负载比最小负载大一个数量级,并且负载以每天重复的模式变化。因此,动态分配计算资源以匹配需求是明智的。可以在检测到负载增加后响应性地分配资源,但这会给已经过载的系统带来额外的负担,需要重新配置。在负载增加之前进行预测分配显然是可取的。我们提出了P-Store,这是第一个使用预测的弹性OLTP DBMS,并将其应用于大型在线零售商B2W Digital (B2W)的工作负载。我们的研究表明,在B2W的工作负载上,P-Store的性能优于响应式系统,因为它导致的延迟违规减少了72%,并且在使用50%的服务器时实现了与峰值需求静态分配相当的性能。
{"title":"P-Store: An Elastic Database System with Predictive Provisioning","authors":"Rebecca Taft, Nosayba El-Sayed, M. Serafini, Yu Lu, Ashraf Aboulnaga, M. Stonebraker, Ricardo Mayerhofer, Francisco Jose Andrade","doi":"10.1145/3183713.3190650","DOIUrl":"https://doi.org/10.1145/3183713.3190650","url":null,"abstract":"OLTP database systems are a critical part of the operation of many enterprises. Such systems are often configured statically with sufficient capacity for peak load. For many OLTP applications, however, the maximum load is an order of magnitude larger than the minimum, and load varies in a repeating daily pattern. It is thus prudent to allocate computing resources dynamically to match demand. One can allocate resources reactively after a load increase is detected, but this places additional burden on the already-overloaded system to reconfigure. A predictive allocation, in advance of load increases, is clearly preferable. We present P-Store, the first elastic OLTP DBMS to use prediction, and apply it to the workload of B2W Digital (B2W), a large online retailer. Our study shows that P-Store outperforms a reactive system on B2W's workload by causing 72% fewer latency violations, and achieves performance comparable to static allocation for peak demand while using 50% fewer servers.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81501632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Deep Learning for Entity Matching: A Design Space Exploration 面向实体匹配的深度学习:一个设计空间探索
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196926
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, A. Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, V. Raghavendra
Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.
实体匹配(EM)查找引用相同现实世界实体的数据实例。在本文中,我们研究了将深度学习(DL)应用于EM,以了解DL的优点和局限性。我们回顾了许多为文本处理中的相关匹配任务(例如,实体链接,文本蕴涵等)开发的深度学习解决方案。我们对这些解决方案进行了分类,并定义了EM的深度学习解决方案空间,具体体现为四个具有不同表示能力的解决方案:SIF、RNN、Attention和Hybrid。接下来,我们将探讨深度学习可以帮助解决的EM问题类型。我们考虑三种这样的问题类型,它们分别匹配结构化数据实例、文本实例和脏实例。我们将上述四种深度学习解决方案与麦哲伦(最先进的基于学习的EM解决方案)进行了实证比较。结果表明,深度学习在结构化EM上的表现并不优于当前的解决方案,但在文本EM和脏EM上的表现明显优于它们。对于从业者来说,这表明他们应该认真考虑将深度学习用于文本EM和脏EM问题。最后,对深度学习的性能进行了分析,并对未来的研究方向进行了讨论。
{"title":"Deep Learning for Entity Matching: A Design Space Exploration","authors":"Sidharth Mudgal, Han Li, Theodoros Rekatsinas, A. Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, V. Raghavendra","doi":"10.1145/3183713.3196926","DOIUrl":"https://doi.org/10.1145/3183713.3196926","url":null,"abstract":"Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83494672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 427
How to Architect a Query Compiler, Revisited 如何构建查询编译器,重访
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196893
Ruby Y. Tahboub, Grégory M. Essertel, Tiark Rompf
To leverage modern hardware platforms to their fullest, more and more database systems embrace compilation of query plans to native code. In the research community, there is an ongoing debate about the best way to architect such query compilers. This is perceived to be a difficult task, requiring techniques fundamentally different from traditional interpreted query execution. We aim to contribute to this discussion by drawing attention to an old but underappreciated idea known as Futamura projections, which fundamentally link interpreters and compilers. Guided by this idea, we demonstrate that efficient query compilation can actually be very simple, using techniques that are no more difficult than writing a query interpreter in a high-level language. Moreover, we demonstrate how intricate compilation patterns that were previously used to justify multiple compiler passes can be realized in one single, straightforward, generation pass. Key examples are injection of specialized index structures, data representation changes such as string dictionaries, and various kinds of code motion to reduce the amount of work on the critical path. We present LB2: a high-level query compiler developed in this style that performs on par with, and sometimes beats, the best compiled query engines on the standard TPC-H benchmark.
为了充分利用现代硬件平台,越来越多的数据库系统支持将查询计划编译为本机代码。在研究社区中,关于构建此类查询编译器的最佳方式一直存在争论。这被认为是一项困难的任务,需要与传统解释查询执行根本不同的技术。我们的目标是通过引起人们对一个古老但未得到充分重视的概念的关注来促进这一讨论,即Futamura投影,它从根本上连接了解释器和编译器。在这个思想的指导下,我们演示了高效的查询编译实际上可以非常简单,使用的技术并不比用高级语言编写查询解释器更难。此外,我们还演示了以前用于验证多个编译器传递的复杂编译模式是如何在一个简单的生成传递中实现的。关键的例子包括注入专门的索引结构、更改数据表示(如字符串字典)和各种代码移动,以减少关键路径上的工作量。我们介绍LB2:一个以这种风格开发的高级查询编译器,它的性能与标准TPC-H基准上最好的编译查询引擎相当,有时甚至超过它们。
{"title":"How to Architect a Query Compiler, Revisited","authors":"Ruby Y. Tahboub, Grégory M. Essertel, Tiark Rompf","doi":"10.1145/3183713.3196893","DOIUrl":"https://doi.org/10.1145/3183713.3196893","url":null,"abstract":"To leverage modern hardware platforms to their fullest, more and more database systems embrace compilation of query plans to native code. In the research community, there is an ongoing debate about the best way to architect such query compilers. This is perceived to be a difficult task, requiring techniques fundamentally different from traditional interpreted query execution. We aim to contribute to this discussion by drawing attention to an old but underappreciated idea known as Futamura projections, which fundamentally link interpreters and compilers. Guided by this idea, we demonstrate that efficient query compilation can actually be very simple, using techniques that are no more difficult than writing a query interpreter in a high-level language. Moreover, we demonstrate how intricate compilation patterns that were previously used to justify multiple compiler passes can be realized in one single, straightforward, generation pass. Key examples are injection of specialized index structures, data representation changes such as string dictionaries, and various kinds of code motion to reduce the amount of work on the critical path. We present LB2: a high-level query compiler developed in this style that performs on par with, and sometimes beats, the best compiled query engines on the standard TPC-H benchmark.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86370801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Joins over UNION ALL Queries in Teradata®: Demonstration of Optimized Execution Teradata®中UNION ALL查询的连接:优化执行的演示
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193565
Mohammed Al-Kateb, Paul Sinclair, G. Au, Sanjay Nair, Mark Sirek, Lu Ma, M. Eltabakh
The UNION ALL set operator is useful for combining data from multiple sources. With the emergence and prevalence of big data ecosystems in which data is typically stored on multiple systems, UNION ALL has become even more important in many analytical queries. In this project, we demonstrate novel cost-based optimization techniques implemented in Teradata Database for join queries involving UNION ALL views and derived tables. Instead of the naive and traditional way of spooling each UNION ALL branch to a common spool prior to performing join operations, which can be prohibitively expensive, we demonstrate new techniques developed in Teradata Database including: 1) Cost-based pushing of joins into UNION ALL branches, 2) Branch grouping strategy prior to join pushing, 3) Geography adjustment of the pushed relations to avoid unnecessary redistribution or duplication, 4) Iterative join decomposition of a pushed join to multiple joins, and 5) Combining multiple join steps into a single multisource join step. In the demonstration, we use the Teradata Visual Explain tool, which offers a rich set of visual rendering capabilities of query plans, the display of various metadata information for each plan step, and several interactive UGI options for end-users.
UNION ALL集合操作符用于组合来自多个数据源的数据。随着大数据生态系统的出现和普及,数据通常存储在多个系统中,UNION ALL在许多分析查询中变得更加重要。在这个项目中,我们将演示在Teradata数据库中实现的新颖的基于成本的优化技术,用于涉及UNION ALL视图和派生表的连接查询。在执行连接操作之前,将每个UNION ALL分支假脱机到一个公共假脱机的传统方法可能非常昂贵,我们将演示在Teradata数据库中开发的新技术,包括:1)基于成本将联接推入UNION ALL分支,2)在联接推入前采用分支分组策略,3)对推入关系进行地理调整,避免不必要的重新分配或重复,4)将推入联接迭代分解为多个联接,5)将多个联接步骤合并为单个多源联接步骤。在演示中,我们使用Teradata Visual Explain工具,它提供了一组丰富的查询计划的可视化呈现功能,为每个计划步骤显示各种元数据信息,并为最终用户提供了几个交互式UGI选项。
{"title":"Joins over UNION ALL Queries in Teradata®: Demonstration of Optimized Execution","authors":"Mohammed Al-Kateb, Paul Sinclair, G. Au, Sanjay Nair, Mark Sirek, Lu Ma, M. Eltabakh","doi":"10.1145/3183713.3193565","DOIUrl":"https://doi.org/10.1145/3183713.3193565","url":null,"abstract":"The UNION ALL set operator is useful for combining data from multiple sources. With the emergence and prevalence of big data ecosystems in which data is typically stored on multiple systems, UNION ALL has become even more important in many analytical queries. In this project, we demonstrate novel cost-based optimization techniques implemented in Teradata Database for join queries involving UNION ALL views and derived tables. Instead of the naive and traditional way of spooling each UNION ALL branch to a common spool prior to performing join operations, which can be prohibitively expensive, we demonstrate new techniques developed in Teradata Database including: 1) Cost-based pushing of joins into UNION ALL branches, 2) Branch grouping strategy prior to join pushing, 3) Geography adjustment of the pushed relations to avoid unnecessary redistribution or duplication, 4) Iterative join decomposition of a pushed join to multiple joins, and 5) Combining multiple join steps into a single multisource join step. In the demonstration, we use the Teradata Visual Explain tool, which offers a rich set of visual rendering capabilities of query plans, the display of various metadata information for each plan step, and several interactive UGI options for end-users.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"81 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83911918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VALMOD: A Suite for Easy and Exact Detection of Variable Length Motifs in Data Series VALMOD:一套简单而准确地检测数据序列中变长模的工具
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193556
Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn J. Keogh
Data series motif discovery represents one of the most useful primitives for data series mining, with applications to many domains, such as robotics, entomology, seismology, medicine, and climatology, and others. The state-of-the-art motif discovery tools still require the user to provide the motif length. Yet, in several cases, the choice of motif length is critical for their detection. Unfortunately, the obvious brute-force solution, which tests all lengths within a given range, is computationally untenable, and does not provide any support for ranking motifs at different resolutions (i.e., lengths). We demonstrate VALMOD, our scalable motif discovery algorithm that efficiently finds all motifs in a given range of lengths, and outputs a length-invariant ranking of motifs. Furthermore, we support the analysis process by means of a newly proposed meta-data structure that helps the user to select the most promising pattern length. This demo aims at illustrating in detail the steps of the proposed approach, showcasing how our algorithm and corresponding graphical insights enable users to efficiently identify the correct motifs.
数据序列基元发现是数据序列挖掘中最有用的基元之一,应用于机器人、昆虫学、地震学、医学和气候学等许多领域。最先进的motif发现工具仍然需要用户提供motif长度。然而,在一些情况下,基序长度的选择对它们的检测至关重要。不幸的是,测试给定范围内所有长度的明显暴力解决方案在计算上是站不住脚的,并且不支持在不同分辨率(即长度)下对图案进行排序。我们展示了VALMOD,我们的可扩展motif发现算法,它可以有效地找到给定长度范围内的所有motif,并输出一个长度不变的motif排名。此外,我们通过新提出的元数据结构来支持分析过程,该结构可以帮助用户选择最有希望的模式长度。本演示旨在详细说明所提出方法的步骤,展示我们的算法和相应的图形见解如何使用户能够有效地识别正确的图案。
{"title":"VALMOD: A Suite for Easy and Exact Detection of Variable Length Motifs in Data Series","authors":"Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn J. Keogh","doi":"10.1145/3183713.3193556","DOIUrl":"https://doi.org/10.1145/3183713.3193556","url":null,"abstract":"Data series motif discovery represents one of the most useful primitives for data series mining, with applications to many domains, such as robotics, entomology, seismology, medicine, and climatology, and others. The state-of-the-art motif discovery tools still require the user to provide the motif length. Yet, in several cases, the choice of motif length is critical for their detection. Unfortunately, the obvious brute-force solution, which tests all lengths within a given range, is computationally untenable, and does not provide any support for ranking motifs at different resolutions (i.e., lengths). We demonstrate VALMOD, our scalable motif discovery algorithm that efficiently finds all motifs in a given range of lengths, and outputs a length-invariant ranking of motifs. Furthermore, we support the analysis process by means of a newly proposed meta-data structure that helps the user to select the most promising pattern length. This demo aims at illustrating in detail the steps of the proposed approach, showcasing how our algorithm and corresponding graphical insights enable users to efficiently identify the correct motifs.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"201 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76998094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
SketchML: Accelerating Distributed Machine Learning with Data Sketches SketchML:使用数据草图加速分布式机器学习
Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196894
Jiawei Jiang, Fangcheng Fu, Tong Yang, B. Cui
To address the challenge of explosive big data, distributed machine learning (ML) has drawn the interests of many researchers. Since many distributed ML algorithms trained by stochastic gradient descent (SGD) involve communicating gradients through the network, it is important to compress the transferred gradient. A category of low-precision algorithms can significantly reduce the size of gradients, at the expense of some precision loss. However, existing low-precision methods are not suitable for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study is there a compression method that can efficiently handle a sparse and nonuniform gradient consisting of key-value pairs? Our first contribution is a sketch based method that compresses the gradient values. Sketch is a class of algorithms using a probabilistic data structure to approximate the distribution of input data. We design a quantile-bucket quantification method that uses a quantile sketch to sort gradient values into buckets and encodes them with the bucket indexes. To further compress the bucket indexes, our second contribution is a sketch algorithm, namely MinMaxSketch. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. The third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and stores them with fewer bytes. We also theoretically discuss the correctness and the error bound of three proposed methods. To the best of our knowledge, this is the first effort combining data sketch with ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc., and show that our method is up to 10X faster than existing methods.
为了应对爆炸性大数据带来的挑战,分布式机器学习(ML)引起了许多研究人员的兴趣。由于许多由随机梯度下降(SGD)训练的分布式机器学习算法涉及通过网络通信梯度,因此压缩传输的梯度非常重要。一类低精度算法可以显著减小梯度的大小,但代价是一定的精度损失。然而,现有的低精度方法不适用于梯度稀疏和不均匀分布的许多情况。在本文中,我们研究了是否有一种压缩方法可以有效地处理由键值对组成的稀疏非均匀梯度。我们的第一个贡献是基于草图的方法来压缩梯度值。Sketch是一类使用概率数据结构来近似输入数据分布的算法。我们设计了一种分位数-桶量化方法,使用分位数草图将梯度值排序到桶中,并用桶索引对其进行编码。为了进一步压缩桶索引,我们的第二个贡献是一个草图算法,即MinMaxSketch。MinMaxSketch构建一组哈希表,并使用MinMax策略解决哈希冲突。本文的第三个贡献是一种增量二进制编码方法,该方法可以计算梯度键的增量并使用更少的字节存储它们。从理论上讨论了三种方法的正确性和误差范围。据我们所知,这是第一次将数据草图与机器学习相结合。我们在我们的工业合作伙伴腾讯公司的一个真实集群中实现了一个原型系统,并表明我们的方法比现有方法快10倍。
{"title":"SketchML: Accelerating Distributed Machine Learning with Data Sketches","authors":"Jiawei Jiang, Fangcheng Fu, Tong Yang, B. Cui","doi":"10.1145/3183713.3196894","DOIUrl":"https://doi.org/10.1145/3183713.3196894","url":null,"abstract":"To address the challenge of explosive big data, distributed machine learning (ML) has drawn the interests of many researchers. Since many distributed ML algorithms trained by stochastic gradient descent (SGD) involve communicating gradients through the network, it is important to compress the transferred gradient. A category of low-precision algorithms can significantly reduce the size of gradients, at the expense of some precision loss. However, existing low-precision methods are not suitable for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study is there a compression method that can efficiently handle a sparse and nonuniform gradient consisting of key-value pairs? Our first contribution is a sketch based method that compresses the gradient values. Sketch is a class of algorithms using a probabilistic data structure to approximate the distribution of input data. We design a quantile-bucket quantification method that uses a quantile sketch to sort gradient values into buckets and encodes them with the bucket indexes. To further compress the bucket indexes, our second contribution is a sketch algorithm, namely MinMaxSketch. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. The third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and stores them with fewer bytes. We also theoretically discuss the correctness and the error bound of three proposed methods. To the best of our knowledge, this is the first effort combining data sketch with ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc., and show that our method is up to 10X faster than existing methods.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80079774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
期刊
Proceedings of the 2018 International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1