首页 > 最新文献

Proceedings of the 2016 International Conference on Management of Data最新文献

英文 中文
Introduction to Spark 2.0 for Database Researchers 数据库研究人员Spark 2.0简介
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2912565
Michael Armbrust, Doug Bateman, Reynold Xin, M. Zaharia
Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark. Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. This tutorial is designed for database researchers (graduate students, faculty members, and industrial researchers) interested in a brief hands-on overview of Spark. This tutorial covers the core APIs for using Spark 2.0, including DataFrames, Datasets, SQL, streaming and machine learning pipelines. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment. In addition, we will dive into the engine internals to discuss architectural design choices and their implications in practice. We will guide the audience to "hack" Spark by extending its query optimizer to speed up distributed join execution.
Apache Spark最初是加州大学伯克利分校的一个学术研究项目,是大数据分析领域最受欢迎的开源项目之一。超过1000名志愿者为该项目贡献了代码;几乎每个商业供应商都支持它;现在很多大学都开设了Spark课程。自2010年的研究论文以来,Spark已经有了显著的发展:随着Catalyst关系优化器的引入,它的基础api变得更加关系型和结构化,它的执行引擎也在快速发展,以采用数据库系统中最新的研究进展,比如全阶段代码生成。本教程是为对Spark有兴趣的数据库研究人员(研究生、教师和工业研究人员)设计的。本教程涵盖了使用Spark 2.0的核心api,包括dataframe、数据集、SQL、流和机器学习管道。每个主题包括幻灯片和讲座内容,以及通过基于web的笔记本环境动手使用Spark集群。此外,我们将深入研究引擎内部,讨论架构设计选择及其在实践中的含义。我们将引导读者通过扩展查询优化器来“破解”Spark,以加快分布式连接的执行速度。
{"title":"Introduction to Spark 2.0 for Database Researchers","authors":"Michael Armbrust, Doug Bateman, Reynold Xin, M. Zaharia","doi":"10.1145/2882903.2912565","DOIUrl":"https://doi.org/10.1145/2882903.2912565","url":null,"abstract":"Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark. Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. This tutorial is designed for database researchers (graduate students, faculty members, and industrial researchers) interested in a brief hands-on overview of Spark. This tutorial covers the core APIs for using Spark 2.0, including DataFrames, Datasets, SQL, streaming and machine learning pipelines. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment. In addition, we will dive into the engine internals to discuss architectural design choices and their implications in practice. We will guide the audience to \"hack\" Spark by extending its query optimizer to speed up distributed join execution.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88411008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
GPL: A GPU-based Pipelined Query Processing Engine GPL:基于gpu的流水线查询处理引擎
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915224
Johns Paul, Jiong He, Bingsheng He
Graphics Processing Units (GPUs) have evolved as a powerful query co-processor for main memory On-Line Analytical Processing (OLAP) databases. However, existing GPU-based query processors adopt a kernel-based execution approach which optimizes individual kernels for resource utilization and executes the GPU kernels involved in the query plan one by one. Such a kernel-based approach cannot utilize all GPU resources efficiently due to the resource underutilization of individual kernels and memory ping-pong across kernel executions. In this paper, we propose GPL, a novel pipelined query execution engine to improve the resource utilization of query co-processing on the GPU. Different from the existing kernel-based execution, GPL takes advantage of hardware features of new-generation GPUs including concurrent kernel execution and efficient data communication channel between kernels. We further develop an analytical model to guide the generation of the optimal pipelined query plan. Thus, the tile size of the pipelined query execution can be adapted in a cost-based manner. We evaluate GPL with TPC-H queries on both AMD and NVIDIA GPUs. The experimental results show that 1) the analytical model is able to guide determining the suitable parameter values in pipelined query execution plan, and 2) GPL is able to significantly outperform the state-of-the-art kernel-based query processing approaches, with improvement up to 48%.
图形处理单元(gpu)已经发展成为主存联机分析处理(OLAP)数据库的强大查询协处理器。然而,现有的基于GPU的查询处理器采用基于内核的执行方法,优化单个内核的资源利用率,并逐个执行查询计划中涉及的GPU内核。这种基于内核的方法不能有效地利用所有GPU资源,因为单个内核的资源利用率不足,并且在内核执行期间内存会乒乓乒乓。为了提高GPU上查询协同处理的资源利用率,本文提出了一种新的流水线查询执行引擎GPL。与现有的基于内核的执行不同,GPL利用了新一代gpu的硬件特性,包括内核并行执行和内核之间高效的数据通信通道。我们进一步开发了一个分析模型来指导最优流水线查询计划的生成。因此,可以以基于成本的方式调整流水线查询执行的块大小。我们在AMD和NVIDIA gpu上使用TPC-H查询来评估GPL。实验结果表明:1)分析模型能够指导在流水线查询执行计划中确定合适的参数值;2)GPL能够显著优于当前基于核的查询处理方法,提高幅度高达48%。
{"title":"GPL: A GPU-based Pipelined Query Processing Engine","authors":"Johns Paul, Jiong He, Bingsheng He","doi":"10.1145/2882903.2915224","DOIUrl":"https://doi.org/10.1145/2882903.2915224","url":null,"abstract":"Graphics Processing Units (GPUs) have evolved as a powerful query co-processor for main memory On-Line Analytical Processing (OLAP) databases. However, existing GPU-based query processors adopt a kernel-based execution approach which optimizes individual kernels for resource utilization and executes the GPU kernels involved in the query plan one by one. Such a kernel-based approach cannot utilize all GPU resources efficiently due to the resource underutilization of individual kernels and memory ping-pong across kernel executions. In this paper, we propose GPL, a novel pipelined query execution engine to improve the resource utilization of query co-processing on the GPU. Different from the existing kernel-based execution, GPL takes advantage of hardware features of new-generation GPUs including concurrent kernel execution and efficient data communication channel between kernels. We further develop an analytical model to guide the generation of the optimal pipelined query plan. Thus, the tile size of the pipelined query execution can be adapted in a cost-based manner. We evaluate GPL with TPC-H queries on both AMD and NVIDIA GPUs. The experimental results show that 1) the analytical model is able to guide determining the suitable parameter values in pipelined query execution plan, and 2) GPL is able to significantly outperform the state-of-the-art kernel-based query processing approaches, with improvement up to 48%.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90944021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Provenance: On and Behind the Screens 出处:屏幕上和屏幕后
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2912568
Melanie Herschel, Marcel Hlawatsch
Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.
收集和处理来源,即描述某些最终产品的生产过程的信息,在各种应用中都很重要,例如,评估质量,确保再现性,或加强对最终产品的信任。在过去,人们提出了不同类型的来源元数据,每种类型都有不同的范围。建议教程的第一部分提供了这些不同类型的来源的概述和比较。为了充分利用出处,必须能够以用户友好的方式与出处数据进行交互并显示出处数据。通常,对来源感兴趣的用户不一定是数据库或查询语言方面的专家,因为他们通常是收集来源的产品和生产过程的领域专家(生物学家、记者等)。此外,在某些情况下,很难单独使用查询来分析和探索来源数据。因此,本教程的第二部分侧重于使用户能够通过适应的可视化来利用出处。为此,在讨论可能的来源可视化之前,我们将介绍一些可视化的基本概念。
{"title":"Provenance: On and Behind the Screens","authors":"Melanie Herschel, Marcel Hlawatsch","doi":"10.1145/2882903.2912568","DOIUrl":"https://doi.org/10.1145/2882903.2912568","url":null,"abstract":"Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88910514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Speedup Graph Processing by Graph Ordering 通过图排序加速图处理
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915220
Hao Wei, J. Yu, Can Lu, Xuemin Lin
The CPU cache performance is one of the key issues to efficiency in database systems. It is reported that cache miss latency takes a half of the execution time in database systems. To improve the CPU cache performance, there are studies to support searching including cache-oblivious, and cache-conscious trees. In this paper, we focus on CPU speedup for graph computing in general by reducing the CPU cache miss ratio for different graph algorithms. The approaches dealing with trees are not applicable to graphs which are complex in nature. In this paper, we explore a general approach to speed up CPU computing, in order to further enhance the efficiency of the graph algorithms without changing the graph algorithms (implementations) and the data structures used. That is, we aim at designing a general solution that is not for a specific graph algorithm, neither for a specific data structure. The approach studied in this work is graph ordering, which is to find the optimal permutation among all nodes in a given graph by keeping nodes that will be frequently accessed together locally, to minimize the CPU cache miss ratio. We prove the graph ordering problem is NP-hard, and give a basic algorithm with a bounded approximation. To improve the time complexity of the basic algorithm, we further propose a new algorithm to reduce the time complexity and improve the efficiency with new optimization techniques based on a new data structure. We conducted extensive experiments to evaluate our approach in comparison with other 9 possible graph orderings (such as the one obtained by METIS) using 8 large real graphs and 9 representative graph algorithms. We confirm that our approach can achieve high performance by reducing the CPU cache miss ratios.
CPU缓存性能是影响数据库系统效率的关键问题之一。据报道,在数据库系统中,缓存丢失延迟占用了一半的执行时间。为了提高CPU缓存性能,有研究支持搜索,包括缓存无关树和缓存意识树。在本文中,我们通过降低不同图算法的CPU缓存缺失率来关注图计算的CPU加速。处理树的方法不适用于本质上复杂的图。在本文中,我们探索了一种加速CPU计算的通用方法,以便在不改变图算法(实现)和所使用的数据结构的情况下进一步提高图算法的效率。也就是说,我们的目标是设计一个通用的解决方案,既不是针对特定的图算法,也不是针对特定的数据结构。本文研究的方法是图排序,即通过将频繁访问的节点集中在局部,找到给定图中所有节点的最优排列,以最小化CPU缓存缺失率。证明了图的排序问题是np困难的,并给出了一个有界逼近的基本算法。为了提高基本算法的时间复杂度,我们进一步提出了一种基于新数据结构的新的优化技术来降低时间复杂度和提高效率的新算法。我们进行了大量的实验来评估我们的方法,并与其他9种可能的图排序(例如METIS获得的图排序)进行比较,使用8个大型真实图和9个代表性图算法。我们确认我们的方法可以通过降低CPU缓存缺失率来实现高性能。
{"title":"Speedup Graph Processing by Graph Ordering","authors":"Hao Wei, J. Yu, Can Lu, Xuemin Lin","doi":"10.1145/2882903.2915220","DOIUrl":"https://doi.org/10.1145/2882903.2915220","url":null,"abstract":"The CPU cache performance is one of the key issues to efficiency in database systems. It is reported that cache miss latency takes a half of the execution time in database systems. To improve the CPU cache performance, there are studies to support searching including cache-oblivious, and cache-conscious trees. In this paper, we focus on CPU speedup for graph computing in general by reducing the CPU cache miss ratio for different graph algorithms. The approaches dealing with trees are not applicable to graphs which are complex in nature. In this paper, we explore a general approach to speed up CPU computing, in order to further enhance the efficiency of the graph algorithms without changing the graph algorithms (implementations) and the data structures used. That is, we aim at designing a general solution that is not for a specific graph algorithm, neither for a specific data structure. The approach studied in this work is graph ordering, which is to find the optimal permutation among all nodes in a given graph by keeping nodes that will be frequently accessed together locally, to minimize the CPU cache miss ratio. We prove the graph ordering problem is NP-hard, and give a basic algorithm with a bounded approximation. To improve the time complexity of the basic algorithm, we further propose a new algorithm to reduce the time complexity and improve the efficiency with new optimization techniques based on a new data structure. We conducted extensive experiments to evaluate our approach in comparison with other 9 possible graph orderings (such as the one obtained by METIS) using 8 large real graphs and 9 representative graph algorithms. We confirm that our approach can achieve high performance by reducing the CPU cache miss ratios.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"11 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85118617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 132
How to Architect a Query Compiler 如何构建查询编译器
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915244
A. Shaikhha, Yannis Klonatos, L. Parreaux, Lewis Brown, Mohammad Dashti, Christoph E. Koch
This paper studies architecting query compilers. The state of the art in query compiler construction is lagging behind that in the compilers field. We attempt to remedy this by exploring the key causes of technical challenges in need of well founded solutions, and by gathering the most relevant ideas and approaches from the PL and compilers communities for easy digestion by database researchers. All query compilers known to us are more or less monolithic template expanders that do the bulk of the compilation task in one large leap. Such systems are hard to build and maintain. We propose to use a stack of multiple DSLs on different levels of abstraction with lowering in multiple steps to make query compilers easier to build and extend, ultimately allowing us to create more convincing and sustainable compiler-based data management systems. We attempt to derive our advice for creating such DSL stacks from widely acceptable principles. We have also re-created a well-known query compiler following these ideas and report on this effort.
本文研究了查询编译器的体系结构。查询编译器构造的技术水平落后于编译器领域。我们试图通过探索技术挑战的关键原因来解决这个问题,并从PL和编译器社区收集最相关的想法和方法,以便数据库研究人员轻松消化。我们所知道的所有查询编译器或多或少都是单一的模板扩展器,它们一次性完成大部分编译任务。这样的系统很难建立和维护。我们建议在不同的抽象层次上使用多个dsl堆栈,降低多个步骤,使查询编译器更容易构建和扩展,最终允许我们创建更令人信服和可持续的基于编译器的数据管理系统。我们试图从被广泛接受的原则中得出创建这种DSL堆栈的建议。我们还按照这些想法重新创建了一个著名的查询编译器,并报告了这方面的工作。
{"title":"How to Architect a Query Compiler","authors":"A. Shaikhha, Yannis Klonatos, L. Parreaux, Lewis Brown, Mohammad Dashti, Christoph E. Koch","doi":"10.1145/2882903.2915244","DOIUrl":"https://doi.org/10.1145/2882903.2915244","url":null,"abstract":"This paper studies architecting query compilers. The state of the art in query compiler construction is lagging behind that in the compilers field. We attempt to remedy this by exploring the key causes of technical challenges in need of well founded solutions, and by gathering the most relevant ideas and approaches from the PL and compilers communities for easy digestion by database researchers. All query compilers known to us are more or less monolithic template expanders that do the bulk of the compilation task in one large leap. Such systems are hard to build and maintain. We propose to use a stack of multiple DSLs on different levels of abstraction with lowering in multiple steps to make query compilers easier to build and extend, ultimately allowing us to create more convincing and sustainable compiler-based data management systems. We attempt to derive our advice for creating such DSL stacks from widely acceptable principles. We have also re-created a well-known query compiler following these ideas and report on this effort.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86194786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
PerfEnforce Demonstration: Data Analytics with Performance Guarantees perfenforcement演示:具有性能保证的数据分析
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899402
Jennifer Ortiz, Brendan Lee, M. Balazinska
We demonstrate PerfEnforce, a dynamic scaling engine for analytics services. PerfEnforce automatically scales a cluster of virtual machines in order to minimize costs while probabilistically meeting the query runtime guarantees offered by a performance-oriented service level agreement (SLA). The demonstration will show three families of dynamic scaling algorithms --feedback control, reinforcement learning, and online machine learning--and will enable attendees to change tuning parameters, performance thresholds, and workloads to compare and contrast the algorithms in different settings.
我们演示了perfenforcement,一个用于分析服务的动态扩展引擎。perfenforcement自动扩展虚拟机集群,以便在满足面向性能的服务水平协议(SLA)提供的查询运行时保证的同时,最大限度地降低成本。该演示将展示三种动态缩放算法——反馈控制、强化学习和在线机器学习——并将使与会者能够更改调优参数、性能阈值和工作负载,以比较和对比不同设置下的算法。
{"title":"PerfEnforce Demonstration: Data Analytics with Performance Guarantees","authors":"Jennifer Ortiz, Brendan Lee, M. Balazinska","doi":"10.1145/2882903.2899402","DOIUrl":"https://doi.org/10.1145/2882903.2899402","url":null,"abstract":"We demonstrate PerfEnforce, a dynamic scaling engine for analytics services. PerfEnforce automatically scales a cluster of virtual machines in order to minimize costs while probabilistically meeting the query runtime guarantees offered by a performance-oriented service level agreement (SLA). The demonstration will show three families of dynamic scaling algorithms --feedback control, reinforcement learning, and online machine learning--and will enable attendees to change tuning parameters, performance thresholds, and workloads to compare and contrast the algorithms in different settings.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90152143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Wildfire: Concurrent Blazing Data Ingest and Analytics 野火:并发燃烧数据摄取和分析
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899406
Ronald Barber, Matthew Huras, G. Lohman, C. Mohan, René Müller, Fatma Özcan, H. Pirahesh, Vijayshankar Raman, Richard Sidle, O. Sidorkin, Adam J. Storm, Yuanyuan Tian, Pınar Tözün
We demonstrate Hybrid Transactional and Analytics Processing (HTAP) on the Spark platform by the Wildfire prototype, which can ingest up to ~6 million inserts per second per node and simultaneously perform complex SQL analytics queries. Here, a simplified mobile application uses Wildfire to recommend advertising to mobile customers based upon their distance from stores and their interest in products sold by these stores, while continuously graphing analytics results as those customers move and respond to the ads with purchases.
我们通过Wildfire原型在Spark平台上演示了混合事务和分析处理(HTAP),它可以在每个节点上每秒摄取多达600万次插入,并同时执行复杂的SQL分析查询。在这里,一个简化的移动应用程序使用Wildfire向移动客户推荐广告,根据他们与商店的距离以及他们对这些商店销售的产品的兴趣,同时随着这些客户移动和购买广告,不断绘制分析结果的图表。
{"title":"Wildfire: Concurrent Blazing Data Ingest and Analytics","authors":"Ronald Barber, Matthew Huras, G. Lohman, C. Mohan, René Müller, Fatma Özcan, H. Pirahesh, Vijayshankar Raman, Richard Sidle, O. Sidorkin, Adam J. Storm, Yuanyuan Tian, Pınar Tözün","doi":"10.1145/2882903.2899406","DOIUrl":"https://doi.org/10.1145/2882903.2899406","url":null,"abstract":"We demonstrate Hybrid Transactional and Analytics Processing (HTAP) on the Spark platform by the Wildfire prototype, which can ingest up to ~6 million inserts per second per node and simultaneously perform complex SQL analytics queries. Here, a simplified mobile application uses Wildfire to recommend advertising to mobile customers based upon their distance from stores and their interest in products sold by these stores, while continuously graphing analytics results as those customers move and respond to the ads with purchases.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"179 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75957237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Searching Web Data using MinHash LSH 使用MinHash LSH搜索Web数据
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2914838
B. Rao, Erkang Zhu
In this extended abstract, we explore the use of MinHash Locality Sensitive Hashing (MinHash LSH) to address the problem of indexing and searching Web data. We discuss a statistical tuning strategy of MinHash LSH, and experimentally evaluate the accuracy and performance, compared with inverted index. In addition, we describe an on-line demo for the index with real Web data.
在这篇扩展摘要中,我们探讨了使用MinHash Locality Sensitive hash (MinHash LSH)来解决索引和搜索Web数据的问题。讨论了一种MinHash LSH的统计调优策略,并与倒排索引进行了比较,对其精度和性能进行了实验评估。此外,我们还描述了一个使用真实Web数据的索引的在线演示。
{"title":"Searching Web Data using MinHash LSH","authors":"B. Rao, Erkang Zhu","doi":"10.1145/2882903.2914838","DOIUrl":"https://doi.org/10.1145/2882903.2914838","url":null,"abstract":"In this extended abstract, we explore the use of MinHash Locality Sensitive Hashing (MinHash LSH) to address the problem of indexing and searching Web data. We discuss a statistical tuning strategy of MinHash LSH, and experimentally evaluate the accuracy and performance, compared with inverted index. In addition, we describe an on-line demo for the index with real Web data.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84433642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Web-based Benchmarks for Forecasting Systems: The ECAST Platform 基于网络的预测系统基准:ECAST平台
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899399
R. Ulbricht, Claudio Hartmann, M. Hahmann, H. Donker, Wolfgang Lehner
The role of precise forecasts in the energy domain has changed dramatically. New supply forecasting methods are developed to better address this challenge, but meaningful benchmarks are rare and time-intensive. We propose the ECAST online platform in order to solve that problem. The system's capability is demonstrated on a real-world use case by comparing the performance of different prediction tools.
精确预测在能源领域的作用发生了巨大变化。为了更好地应对这一挑战,开发了新的供应预测方法,但有意义的基准很少,而且耗时。为了解决这个问题,我们提出了ECAST在线平台。通过比较不同预测工具的性能,在实际用例中演示了系统的能力。
{"title":"Web-based Benchmarks for Forecasting Systems: The ECAST Platform","authors":"R. Ulbricht, Claudio Hartmann, M. Hahmann, H. Donker, Wolfgang Lehner","doi":"10.1145/2882903.2899399","DOIUrl":"https://doi.org/10.1145/2882903.2899399","url":null,"abstract":"The role of precise forecasts in the energy domain has changed dramatically. New supply forecasting methods are developed to better address this challenge, but meaningful benchmarks are rare and time-intensive. We propose the ECAST online platform in order to solve that problem. The system's capability is demonstrated on a real-world use case by comparing the performance of different prediction tools.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83395200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Hybrid B+-tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms 基于混合B+树的CPU-GPU异构计算平台内存索引解决方案
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2882918
Amirhesam Shahvarani, H. Jacobsen
An in-memory indexing tree is a critical component of many databases. Modern many-core processors, such as GPUs, are offering tremendous amounts of computing power making them an attractive choice for accelerating indexing. However, the memory available to the accelerating co-processor is rather limited and expensive in comparison to the memory available to the CPU. This drawback is a barrier to exploit the computing power of co-processors for arbitrarily large index trees. In this paper, we propose a novel design for a B+-tree based on the heterogeneous computing platform and the hybrid memory architecture found in GPUs. We propose a hybrid CPU-GPU B+-tree, "HB+-tree," which targets high search throughput use cases. Unique to our design is the joint and simultaneous use of computing and memory resources of CPU-GPU systems. Our experiments show that our HB+-tree can perform up to 240 million index queries per second, which is 2.4X higher than our CPU-optimized solution.
内存索引树是许多数据库的关键组件。现代多核处理器,如gpu,提供了巨大的计算能力,使它们成为加速索引的一个有吸引力的选择。然而,与CPU可用的内存相比,加速协处理器可用的内存相当有限且昂贵。这个缺点阻碍了利用协处理器的计算能力来处理任意大的索引树。在本文中,我们提出了一种新的基于异构计算平台和gpu中发现的混合内存架构的B+树设计。我们提出了一个混合CPU-GPU B+树,“HB+树”,它针对高搜索吞吐量的用例。我们设计的独特之处在于CPU-GPU系统的计算和内存资源的联合和同时使用。我们的实验表明,我们的HB+树每秒可以执行高达2.4亿个索引查询,这比我们的cpu优化解决方案高2.4倍。
{"title":"A Hybrid B+-tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms","authors":"Amirhesam Shahvarani, H. Jacobsen","doi":"10.1145/2882903.2882918","DOIUrl":"https://doi.org/10.1145/2882903.2882918","url":null,"abstract":"An in-memory indexing tree is a critical component of many databases. Modern many-core processors, such as GPUs, are offering tremendous amounts of computing power making them an attractive choice for accelerating indexing. However, the memory available to the accelerating co-processor is rather limited and expensive in comparison to the memory available to the CPU. This drawback is a barrier to exploit the computing power of co-processors for arbitrarily large index trees. In this paper, we propose a novel design for a B+-tree based on the heterogeneous computing platform and the hybrid memory architecture found in GPUs. We propose a hybrid CPU-GPU B+-tree, \"HB+-tree,\" which targets high search throughput use cases. Unique to our design is the joint and simultaneous use of computing and memory resources of CPU-GPU systems. Our experiments show that our HB+-tree can perform up to 240 million index queries per second, which is 2.4X higher than our CPU-optimized solution.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83571702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
期刊
Proceedings of the 2016 International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1