Proceedings. ACM-SIGMOD International Conference on Management of Data最新文献_第10页

Fine-grained disclosure control for app ecosystems 应用生态系统的细粒度披露控制

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2467798

G. Bender, Lucja Kot, J. Gehrke, Christoph E. Koch

The modern computing landscape contains an increasing number of app ecosystems, where users store personal data on platforms such as Facebook or smartphones. APIs enable third-party applications (apps) to utilize that data. A key concern associated with app ecosystems is the confidentiality of user data. In this paper, we develop a new model of disclosure in app ecosystems. In contrast with previous solutions, our model is data-derived and semantically meaningful. Information disclosure is modeled in terms of a set of distinguished security views. Each query is labeled with the precise set of security views that is needed to answer it, and these labels drive policy decisions. We explain how our disclosure model can be used in practice and provide algorithms for labeling conjunctive queries for the case of single-atom security views. We show that our approach is useful by demonstrating the scalability of our algorithms and by applying it to the real-world disclosure control system used by Facebook.

现代计算领域包含越来越多的应用生态系统，用户将个人数据存储在Facebook或智能手机等平台上。api使第三方应用程序(app)能够利用这些数据。与应用生态系统相关的一个关键问题是用户数据的保密性。本文提出了一种新的应用生态系统信息披露模型。与以前的解决方案相比，我们的模型是数据派生的，并且具有语义意义。信息披露是根据一组不同的安全视图建模的。每个查询都用回答查询所需的一组精确的安全视图进行标记，这些标签驱动策略决策。我们解释了如何在实践中使用我们的披露模型，并提供了标记单原子安全视图情况下的联合查询的算法。我们通过展示算法的可扩展性并将其应用于Facebook使用的真实信息披露控制系统来证明我们的方法是有用的。

引用次数: 14

GeoDeepDive: statistical inference using familiar data-processing languages GeoDeepDive:使用熟悉的数据处理语言进行统计推断

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463680

Ce Zhang, Vidhya Govindaraju, J. Borchardt, Timothy L. Foltz, C. Ré, S. Peters

We describe our proposed demonstration of GeoDeepDive, a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles. This requires solving a host of classical data management challenges including data acquisition (e.g., from scanned documents), data extraction, and data integration. SIGMOD attendees will see demonstrations of three aspects of our system: (1) an end-to-end system that is of a high enough quality to perform novel geological science, but is written by a small enough team so that each aspect can be manageably explained; (2) a simple feature engineering system that allows a user to write in familiar SQL or Python; and (3) the effect of different sources of feedback on result quality including expert labeling, distant supervision, traditional rules, and crowd-sourced data. Our prototype builds on our work integrating statistical inference and learning tools into traditional database systems. If successful, our demonstration will allow attendees to see that data processing systems that use machine learning contain many familiar data processing problems such as efficient querying, indexing, and supporting tools for database-backed websites, none of which are machine-learning problems, per se.

我们描述了我们提出的GeoDeepDive演示，这是一个帮助地球科学家发现埋藏在地质期刊文章文本、表格和图形中的信息和知识的系统。这需要解决大量的传统数据管理挑战，包括数据采集(例如，从扫描文档)、数据提取和数据集成。SIGMOD与会者将看到我们系统的三个方面的演示:(1)一个端到端的系统，它具有足够高的质量来执行新的地质科学，但由一个足够小的团队编写，因此每个方面都可以很好地解释;(2)一个简单的特征工程系统，允许用户用熟悉的SQL或Python编写;(3)专家标注、远程监督、传统规则和众包数据等不同反馈来源对结果质量的影响。我们的原型建立在我们将统计推断和学习工具集成到传统数据库系统中的工作之上。如果成功，我们的演示将让与会者看到，使用机器学习的数据处理系统包含许多熟悉的数据处理问题，如高效查询、索引和数据库支持网站的支持工具，这些问题本身都不是机器学习问题。

{"title":"GeoDeepDive: statistical inference using familiar data-processing languages","authors":"Ce Zhang, Vidhya Govindaraju, J. Borchardt, Timothy L. Foltz, C. Ré, S. Peters","doi":"10.1145/2463676.2463680","DOIUrl":"https://doi.org/10.1145/2463676.2463680","url":null,"abstract":"We describe our proposed demonstration of GeoDeepDive, a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles. This requires solving a host of classical data management challenges including data acquisition (e.g., from scanned documents), data extraction, and data integration. SIGMOD attendees will see demonstrations of three aspects of our system: (1) an end-to-end system that is of a high enough quality to perform novel geological science, but is written by a small enough team so that each aspect can be manageably explained; (2) a simple feature engineering system that allows a user to write in familiar SQL or Python; and (3) the effect of different sources of feedback on result quality including expert labeling, distant supervision, traditional rules, and crowd-sourced data.\u0000 Our prototype builds on our work integrating statistical inference and learning tools into traditional database systems. If successful, our demonstration will allow attendees to see that data processing systems that use machine learning contain many familiar data processing problems such as efficient querying, indexing, and supporting tools for database-backed websites, none of which are machine-learning problems, per se.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84485759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Split query processing in polybase 拆分查询处理在polybase

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463709

D. DeWitt, A. Halverson, Rimma V. Nehme, S. Shankar, J. Aguilar-Saborit, Artin Avanes, Miro Flasza, J. Gramling

This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.

本文介绍了SQL Server PDW V2的一个特性Polybase，它允许用户使用标准的SQL查询语言来管理和查询存储在Hadoop集群中的数据。其他数据库系统通过使用外部表机制只提供hdfs驻留数据的关系视图，而Polybase采用了拆分查询处理范式，其中hdfs驻留数据上的SQL操作符由PDW查询优化器转换为MapReduce作业，然后在Hadoop集群上执行。本文描述了Polybase的设计和实现，并进行了全面的性能评估，探讨了使用分割查询处理范式来执行涉及关系DBMS中的结构化数据和Hadoop中的非结构化数据的查询的好处。我们的结果表明，虽然使用基于分割的查询执行范例可以将某些查询的性能提高10倍，但在决定将SQL操作符推到Hadoop上是否有利时，必须使用基于成本的查询优化器，该优化器会考虑广泛的因素。这些因素包括谓词的选择性因素、两个簇的相对大小，以及它们的节点是否位于同一位置。此外，必须仔细考虑Java和SQL语言的语义差异，以避免改变查询的预期结果。

{"title":"Split query processing in polybase","authors":"D. DeWitt, A. Halverson, Rimma V. Nehme, S. Shankar, J. Aguilar-Saborit, Artin Avanes, Miro Flasza, J. Gramling","doi":"10.1145/2463676.2463709","DOIUrl":"https://doi.org/10.1145/2463676.2463709","url":null,"abstract":"This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87495680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 150

GRDB: a system for declarative and interactive analysis of noisy information networks GRDB:用于嘈杂信息网络的声明性和交互式分析的系统

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465257

W. E. Moustafa, Hui Miao, A. Deshpande, L. Getoor

There is a growing interest in methods for analyzing data describing networks of all types, including biological, physical, social, and scientific collaboration networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This demonstration presents GrDB, a system that enables data analysts to write declarative programs to specify and combine different network data cleaning tasks, visualize the output, and engage in the process of decision review and correction if necessary. The declarative interface of GrDB makes it very easy to quickly write analysis tasks and execute them over data, while the visual component facilitates debugging the program and performing fine grained corrections.

人们对各种类型网络(包括生物、物理、社会和科学合作网络)的数据分析方法越来越感兴趣。通常，描述这些网络的数据是观测到的，因此有噪声和不完整;对于有意义的数据分析，它通常处于错误的保真度和抽象级别。这个演示展示了GrDB，这个系统使数据分析人员能够编写声明性程序来指定和组合不同的网络数据清理任务，可视化输出，并在必要时参与决策审查和纠正过程。GrDB的声明性接口使得快速编写分析任务并在数据上执行它们变得非常容易，而可视化组件有助于调试程序并执行细粒度更正。

引用次数: 5

Noah: a dynamic ridesharing system 诺亚:一个动态的拼车系统

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463695

Charles Tian, Y. Huang, Zhi Liu, F. Bastani, R. Jin

This demo presents Noah: a dynamic ridesharing system. Noah supports large scale real-time ridesharing with service guarantee on road networks. Taxis and trip requests are dynamically matched. Different from traditional systems, a taxi can have more than one customer on board given that all waiting time and service time constraints of trips are satisfied. Noah's real-time response relies on three main components: (1) a fast shortest path algorithm with caching on road networks; (2) fast dynamic matching algorithms to schedule ridesharing on the fly; (3) a spatial indexing method for fast retrieving moving taxis. Users will be able to submit requests from a smartphone, choose specific parameters such as number of taxis in the system, service constraints, and matching algorithms, to explore the internal functionalities and implementations of Noah. The system analyzer will show the system performance including average waiting time, average detour percentage, average response time, and average level of sharing. Taxis, routes, and requests will be animated and visualized through Google Maps API. The demo is based on trips of 17,000 Shanghai taxis for one day (May 29, 2009); the dataset contains 432,327 trips. Each trip includes the starting and destination coordinates and the start time. An iPhone application is implemented to allow users to submit a trip request to the Noah system during the demonstration.

这个演示展示了Noah:一个动态拼车系统。诺亚支持大规模实时拼车，并在道路网络上提供服务保障。出租车和出行请求是动态匹配的。与传统系统不同的是，在满足所有行程等待时间和服务时间限制的情况下，一辆出租车可以有多个乘客在车上。Noah的实时响应依赖于三个主要组成部分:(1)在道路网络上高速缓存的快速最短路径算法;(2)基于快速动态匹配算法的拼车调度;(3)一种快速检索移动出租车的空间索引方法。用户可以通过智能手机提交请求，选择特定参数，如系统中的出租车数量、服务约束和匹配算法，以探索Noah的内部功能和实现。系统分析器将显示系统性能，包括平均等待时间、平均绕行百分比、平均响应时间和平均共享级别。出租车、路线和请求将通过谷歌Maps API动画化和可视化。该演示基于上海一天(2009年5月29日)17000辆出租车的行程;该数据集包含432,327次旅行。每一次行程都包含起点和终点坐标以及开始时间。实现了一个iPhone应用程序，允许用户在演示期间向Noah系统提交旅行请求。

{"title":"Noah: a dynamic ridesharing system","authors":"Charles Tian, Y. Huang, Zhi Liu, F. Bastani, R. Jin","doi":"10.1145/2463676.2463695","DOIUrl":"https://doi.org/10.1145/2463676.2463695","url":null,"abstract":"This demo presents Noah: a dynamic ridesharing system. Noah supports large scale real-time ridesharing with service guarantee on road networks. Taxis and trip requests are dynamically matched. Different from traditional systems, a taxi can have more than one customer on board given that all waiting time and service time constraints of trips are satisfied. Noah's real-time response relies on three main components: (1) a fast shortest path algorithm with caching on road networks; (2) fast dynamic matching algorithms to schedule ridesharing on the fly; (3) a spatial indexing method for fast retrieving moving taxis. Users will be able to submit requests from a smartphone, choose specific parameters such as number of taxis in the system, service constraints, and matching algorithms, to explore the internal functionalities and implementations of Noah. The system analyzer will show the system performance including average waiting time, average detour percentage, average response time, and average level of sharing. Taxis, routes, and requests will be animated and visualized through Google Maps API. The demo is based on trips of 17,000 Shanghai taxis for one day (May 29, 2009); the dataset contains 432,327 trips. Each trip includes the starting and destination coordinates and the start time. An iPhone application is implemented to allow users to submit a trip request to the Noah system during the demonstration.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89777774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Parallel analytics as a service 并行分析即服务

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463714

Petrie Wong, Zhian He, Eric Lo

Recently, massively parallel processing relational database systems (MPPDBs) have gained much momentum in the big data analytic market. With the advent of hosted cloud computing, we envision that the offering of MPPDB-as-a-Service (MPPDBaaS) will become attractive for companies having analytical tasks on only hundreds gigabytes to some ten terabytes of data because they can enjoy high-end parallel analytics at a cheap cost. This paper presents Thrifty, a prototype implementation of MPPDB-as-a-service. The major research issue is how to achieve a lower total cost of ownership by consolidating thousands of MPPDB tenants on to a shared hardware infrastructure, with a performance SLA that guarantees the tenants can obtain the query results as if they are executing their queries on dedicated machines. Thrifty achieves the goal by using a tenant-driven design that includes (1) a cluster design that carefully arranges the nodes in the cluster into groups and creates an MPPDB for each group of nodes, (2) a tenant placement that assigns each tenant to several MPPDBs (for high availability service through replication), and (3) a query routing algorithm that routes a tenant's query to the proper MPPDB at run-time. Experiments show that in a MPPDBaaS with 5000 tenants, where each tenant requests 2 to 32 nodes MPPDB to query against 200GB to 3.2TB of data, Thrifty can serve all the tenants with a 99.9% performance SLA guarantee and a high availability replication factor of 3, using only 18.7% of the nodes requested by the tenants.

近年来，大规模并行处理关系数据库系统(mppdb)在大数据分析市场中获得了很大的发展势头。随着托管云计算的出现，我们设想MPPDB-as-a-Service (MPPDBaaS)的提供将对那些只有几百gb到10 tb数据的分析任务的公司变得有吸引力，因为他们可以以低廉的成本享受高端的并行分析。本文提出了一个mppdb即服务的原型实现Thrifty。主要的研究问题是如何通过将数千个MPPDB租户整合到共享的硬件基础设施上来实现更低的总拥有成本，并使用性能SLA保证租户可以获得查询结果，就像他们在专用机器上执行查询一样。Thrifty通过使用租户驱动的设计实现了这一目标，该设计包括:(1)将集群中的节点仔细地分组并为每组节点创建一个MPPDB的集群设计，(2)将每个租户分配给几个MPPDB的租户布局(通过复制实现高可用性服务)，以及(3)查询路由算法，该算法在运行时将租户的查询路由到适当的MPPDB。实验表明，在一个有5000个租户的MPPDBaaS中，每个租户请求2到32个节点的MPPDB来查询200GB到3.2TB的数据，Thrifty可以为所有租户提供99.9%的性能SLA保证和3的高可用性复制因子，仅使用租户请求节点的18.7%。

{"title":"Parallel analytics as a service","authors":"Petrie Wong, Zhian He, Eric Lo","doi":"10.1145/2463676.2463714","DOIUrl":"https://doi.org/10.1145/2463676.2463714","url":null,"abstract":"Recently, massively parallel processing relational database systems (MPPDBs) have gained much momentum in the big data analytic market. With the advent of hosted cloud computing, we envision that the offering of MPPDB-as-a-Service (MPPDBaaS) will become attractive for companies having analytical tasks on only hundreds gigabytes to some ten terabytes of data because they can enjoy high-end parallel analytics at a cheap cost. This paper presents Thrifty, a prototype implementation of MPPDB-as-a-service. The major research issue is how to achieve a lower total cost of ownership by consolidating thousands of MPPDB tenants on to a shared hardware infrastructure, with a performance SLA that guarantees the tenants can obtain the query results as if they are executing their queries on dedicated machines. Thrifty achieves the goal by using a tenant-driven design that includes (1) a cluster design that carefully arranges the nodes in the cluster into groups and creates an MPPDB for each group of nodes, (2) a tenant placement that assigns each tenant to several MPPDBs (for high availability service through replication), and (3) a query routing algorithm that routes a tenant's query to the proper MPPDB at run-time. Experiments show that in a MPPDBaaS with 5000 tenants, where each tenant requests 2 to 32 nodes MPPDB to query against 200GB to 3.2TB of data, Thrifty can serve all the tenants with a 99.9% performance SLA guarantee and a high availability replication factor of 3, using only 18.7% of the nodes requested by the tenants.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89039381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

BigBench: towards an industry standard benchmark for big data analytics BigBench:迈向大数据分析的行业标准基准

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463712

A. Ghazal, T. Rabl, Minqing Hu, Francois Raab, Meikel Poess, A. Crolotte, H. Jacobsen

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems. In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques. We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.

学术界、工业界和大量用户对大数据有着极大的兴趣。一些商业和开源提供商发布了各种产品来支持大数据存储和处理。随着这些产品的成熟，有必要评估和比较这些系统的性能。在本文中，我们提出了BigBench，一个端到端的大数据基准提案。BigBench的基本商业模式是产品零售商。该提案涵盖了一个数据模型和合成数据生成器，解决了包含结构化、半结构化和非结构化数据的大数据系统的多样性、速度和数量方面的问题。BigBench数据模型的结构化部分采用了TPC-DS基准，其中丰富了半结构化和非结构化数据组件。半结构化部分捕获零售商网站上的注册用户和访客用户点击。非结构化数据捕获在线提交的产品评论。为BigBench设计的数据生成器基于比例因子提供可扩展的原始数据量。BigBench工作负载是围绕一组针对数据模型的查询设计的。从商业角度来看，这些查询涵盖了麦肯锡提出的不同类别的大数据分析。从技术角度来看，基于数据源、查询处理类型和分析技术，查询被设计为跨越三个不同的维度。我们通过在Teradata Aster数据库上实现BigBench来说明其可行性。该测试包括生成和加载一个200gb的BigBench数据集，并通过执行BigBench查询(使用Teradata Aster SQL-MR编写)和报告其响应时间来测试工作负载。

{"title":"BigBench: towards an industry standard benchmark for big data analytics","authors":"A. Ghazal, T. Rabl, Minqing Hu, Francois Raab, Meikel Poess, A. Crolotte, H. Jacobsen","doi":"10.1145/2463676.2463712","DOIUrl":"https://doi.org/10.1145/2463676.2463712","url":null,"abstract":"There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems.\u0000 In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques.\u0000 We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76438956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 373

Trinity: a distributed graph engine on a memory cloud Trinity:一个基于内存云的分布式图形引擎

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2467799

Bin Shao, Haixun Wang, Yatao Li

Computations performed by graph algorithms are data driven, and require a high degree of random data access. Despite the great progresses made in disk technology, it still cannot provide the level of efficient random access required by graph computation. On the other hand, memory-based approaches usually do not scale due to the capacity limit of single machines. In this paper, we introduce Trinity, a general purpose graph engine over a distributed memory cloud. Through optimized memory management and network communication, Trinity supports fast graph exploration as well as efficient parallel computing. In particular, Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. These enable Trinity to support efficient online query processing and offline analytics on large graphs with just a few commodity machines. Furthermore, Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. Our experiments show Trinity's performance in both low latency graph queries as well as high throughput graph analytics on web-scale, billion-node graphs.

图算法执行的计算是数据驱动的，需要高度的随机数据访问。尽管磁盘技术取得了很大的进步，但它仍然无法提供图计算所需的高效随机访问水平。另一方面，由于单个机器的容量限制，基于内存的方法通常不能扩展。在本文中，我们介绍了Trinity，一个基于分布式存储云的通用图形引擎。通过优化的内存管理和网络通信，Trinity支持快速的图形探索以及高效的并行计算。特别是，Trinity利用在线和离线计算中的图形访问模式来优化内存和通信，以获得最佳性能。这使Trinity能够支持高效的在线查询处理和大型图形的离线分析，只需几台商品机器。此外，Trinity还提供了一种称为TSL的高级规范语言，供用户声明数据模式和通信协议，这为通用图形管理和计算带来了极大的易用性。我们的实验显示了Trinity在低延迟图查询和高吞吐量图分析方面的性能。

{"title":"Trinity: a distributed graph engine on a memory cloud","authors":"Bin Shao, Haixun Wang, Yatao Li","doi":"10.1145/2463676.2467799","DOIUrl":"https://doi.org/10.1145/2463676.2467799","url":null,"abstract":"Computations performed by graph algorithms are data driven, and require a high degree of random data access. Despite the great progresses made in disk technology, it still cannot provide the level of efficient random access required by graph computation. On the other hand, memory-based approaches usually do not scale due to the capacity limit of single machines. In this paper, we introduce Trinity, a general purpose graph engine over a distributed memory cloud. Through optimized memory management and network communication, Trinity supports fast graph exploration as well as efficient parallel computing. In particular, Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. These enable Trinity to support efficient online query processing and offline analytics on large graphs with just a few commodity machines. Furthermore, Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. Our experiments show Trinity's performance in both low latency graph queries as well as high throughput graph analytics on web-scale, billion-node graphs.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78223575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 461

WOW: what the world of (data) warehousing can learn from the World of Warcraft 《魔兽世界》:数据仓库领域可以从《魔兽世界》中学到什么

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465267

René Müller, T. Kaldewey, G. Lohman, J. McPherson

Although originally designed to accelerate pixel monsters, graphics Processing Units (GPUs) have been used for some time as accelerators for selected data base operations. However, to the best of our knowledge, no one has yet reported building a complete system that allows executing complex analytics queries, much less an entire data warehouse benchmark at realistic scale. In this demo, we showcase such a complete system prototype running on a high-end GPU paired with an IBM storage system that achieves >90% hardware efficiency. Our solution delivers sustainable high throughput for business analytics queries in a realistic scenario, i.e., the Star Schema Benchmark at scale factor 1,000. Attendees can interact with our system through a graphical user interface on a tablet PC. They will be able to experience first hand how queries that require processing more than six billion rows, or 100 GB of data, are answered in less than 20 seconds. The user interface allows submitting queries, live performance monitoring of the current query all the way down to the operator level, and viewing the result once the query completes.

虽然最初设计图形处理单元(gpu)是为了加速像素怪物，但图形处理单元(gpu)已经被用作选定数据库操作的加速器有一段时间了。然而，据我们所知，还没有人报告构建了一个允许执行复杂分析查询的完整系统，更不用说实际规模的整个数据仓库基准了。在这个演示中，我们展示了这样一个完整的系统原型，它运行在一个高端GPU上，搭配一个IBM存储系统，实现了90%的硬件效率。我们的解决方案为现实场景中的业务分析查询提供了可持续的高吞吐量，例如，规模因子为1000的星型模式基准。与会者可以通过平板电脑上的图形用户界面与我们的系统进行交互。他们将能够亲身体验需要处理超过60亿行或100 GB数据的查询如何在不到20秒的时间内得到回答。用户界面允许提交查询，对当前查询进行实时性能监控，一直到操作符级别，并在查询完成后查看结果。

{"title":"WOW: what the world of (data) warehousing can learn from the World of Warcraft","authors":"René Müller, T. Kaldewey, G. Lohman, J. McPherson","doi":"10.1145/2463676.2465267","DOIUrl":"https://doi.org/10.1145/2463676.2465267","url":null,"abstract":"Although originally designed to accelerate pixel monsters, graphics Processing Units (GPUs) have been used for some time as accelerators for selected data base operations. However, to the best of our knowledge, no one has yet reported building a complete system that allows executing complex analytics queries, much less an entire data warehouse benchmark at realistic scale. In this demo, we showcase such a complete system prototype running on a high-end GPU paired with an IBM storage system that achieves >90% hardware efficiency. Our solution delivers sustainable high throughput for business analytics queries in a realistic scenario, i.e., the Star Schema Benchmark at scale factor 1,000. Attendees can interact with our system through a graphical user interface on a tablet PC. They will be able to experience first hand how queries that require processing more than six billion rows, or 100 GB of data, are answered in less than 20 seconds. The user interface allows submitting queries, live performance monitoring of the current query all the way down to the operator level, and viewing the result once the query completes.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78933682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Bolt-on causal consistency 附加因果一致性

Proceedings. ACM-SIGMOD International Conference on Management of Data

Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465279

Peter D. Bailis, A. Ghodsi, J. Hellerstein, I. Stoica

We consider the problem of separating consistency-related safety properties from availability and durability in distributed data stores via the application of a "bolt-on" shim layer that upgrades the safety of an underlying general-purpose data store. This shim provides the same consistency guarantees atop a wide range of widely deployed but often inflexible stores. As causal consistency is one of the strongest consistency models that remain available during system partitions, we develop a shim layer that upgrades eventually consistent stores to provide convergent causal consistency. Accordingly, we leverage widely deployed eventually consistent infrastructure as a common substrate for providing causal guarantees. We describe algorithms and shim implementations that are suitable for a large class of application-level causality relationships and evaluate our techniques using an existing, production-ready data store and with real-world explicit causality relationships.

我们考虑了在分布式数据存储中，通过应用“螺栓连接”的垫片层来将一致性相关的安全属性与可用性和持久性分离的问题，该垫片层可以升级底层通用数据存储的安全性。这种衬垫在广泛部署但通常不灵活的商店上提供相同的一致性保证。由于因果一致性是在系统分区期间仍然可用的最强的一致性模型之一，因此我们开发了一个shim层，该层最终升级一致存储以提供收敛的因果一致性。因此，我们利用广泛部署的最终一致的基础设施作为提供因果保证的公共基础。我们描述了适用于大量应用级因果关系的算法和shim实现，并使用现有的、生产就绪的数据存储和现实世界的显式因果关系来评估我们的技术。

引用次数: 228