首页 > 最新文献

Proceedings. ACM-SIGMOD International Conference on Management of Data最新文献

英文 中文
Fine-grained disclosure control for app ecosystems 应用生态系统的细粒度披露控制
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2467798
G. Bender, Lucja Kot, J. Gehrke, Christoph E. Koch
The modern computing landscape contains an increasing number of app ecosystems, where users store personal data on platforms such as Facebook or smartphones. APIs enable third-party applications (apps) to utilize that data. A key concern associated with app ecosystems is the confidentiality of user data. In this paper, we develop a new model of disclosure in app ecosystems. In contrast with previous solutions, our model is data-derived and semantically meaningful. Information disclosure is modeled in terms of a set of distinguished security views. Each query is labeled with the precise set of security views that is needed to answer it, and these labels drive policy decisions. We explain how our disclosure model can be used in practice and provide algorithms for labeling conjunctive queries for the case of single-atom security views. We show that our approach is useful by demonstrating the scalability of our algorithms and by applying it to the real-world disclosure control system used by Facebook.
现代计算领域包含越来越多的应用生态系统,用户将个人数据存储在Facebook或智能手机等平台上。api使第三方应用程序(app)能够利用这些数据。与应用生态系统相关的一个关键问题是用户数据的保密性。本文提出了一种新的应用生态系统信息披露模型。与以前的解决方案相比,我们的模型是数据派生的,并且具有语义意义。信息披露是根据一组不同的安全视图建模的。每个查询都用回答查询所需的一组精确的安全视图进行标记,这些标签驱动策略决策。我们解释了如何在实践中使用我们的披露模型,并提供了标记单原子安全视图情况下的联合查询的算法。我们通过展示算法的可扩展性并将其应用于Facebook使用的真实信息披露控制系统来证明我们的方法是有用的。
{"title":"Fine-grained disclosure control for app ecosystems","authors":"G. Bender, Lucja Kot, J. Gehrke, Christoph E. Koch","doi":"10.1145/2463676.2467798","DOIUrl":"https://doi.org/10.1145/2463676.2467798","url":null,"abstract":"The modern computing landscape contains an increasing number of app ecosystems, where users store personal data on platforms such as Facebook or smartphones. APIs enable third-party applications (apps) to utilize that data. A key concern associated with app ecosystems is the confidentiality of user data.\u0000 In this paper, we develop a new model of disclosure in app ecosystems. In contrast with previous solutions, our model is data-derived and semantically meaningful. Information disclosure is modeled in terms of a set of distinguished security views. Each query is labeled with the precise set of security views that is needed to answer it, and these labels drive policy decisions.\u0000 We explain how our disclosure model can be used in practice and provide algorithms for labeling conjunctive queries for the case of single-atom security views. We show that our approach is useful by demonstrating the scalability of our algorithms and by applying it to the real-world disclosure control system used by Facebook.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"1 1","pages":"869-880"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90247203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
GeoDeepDive: statistical inference using familiar data-processing languages GeoDeepDive:使用熟悉的数据处理语言进行统计推断
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463680
Ce Zhang, Vidhya Govindaraju, J. Borchardt, Timothy L. Foltz, C. Ré, S. Peters
We describe our proposed demonstration of GeoDeepDive, a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles. This requires solving a host of classical data management challenges including data acquisition (e.g., from scanned documents), data extraction, and data integration. SIGMOD attendees will see demonstrations of three aspects of our system: (1) an end-to-end system that is of a high enough quality to perform novel geological science, but is written by a small enough team so that each aspect can be manageably explained; (2) a simple feature engineering system that allows a user to write in familiar SQL or Python; and (3) the effect of different sources of feedback on result quality including expert labeling, distant supervision, traditional rules, and crowd-sourced data. Our prototype builds on our work integrating statistical inference and learning tools into traditional database systems. If successful, our demonstration will allow attendees to see that data processing systems that use machine learning contain many familiar data processing problems such as efficient querying, indexing, and supporting tools for database-backed websites, none of which are machine-learning problems, per se.
我们描述了我们提出的GeoDeepDive演示,这是一个帮助地球科学家发现埋藏在地质期刊文章文本、表格和图形中的信息和知识的系统。这需要解决大量的传统数据管理挑战,包括数据采集(例如,从扫描文档)、数据提取和数据集成。SIGMOD与会者将看到我们系统的三个方面的演示:(1)一个端到端的系统,它具有足够高的质量来执行新的地质科学,但由一个足够小的团队编写,因此每个方面都可以很好地解释;(2)一个简单的特征工程系统,允许用户用熟悉的SQL或Python编写;(3)专家标注、远程监督、传统规则和众包数据等不同反馈来源对结果质量的影响。我们的原型建立在我们将统计推断和学习工具集成到传统数据库系统中的工作之上。如果成功,我们的演示将让与会者看到,使用机器学习的数据处理系统包含许多熟悉的数据处理问题,如高效查询、索引和数据库支持网站的支持工具,这些问题本身都不是机器学习问题。
{"title":"GeoDeepDive: statistical inference using familiar data-processing languages","authors":"Ce Zhang, Vidhya Govindaraju, J. Borchardt, Timothy L. Foltz, C. Ré, S. Peters","doi":"10.1145/2463676.2463680","DOIUrl":"https://doi.org/10.1145/2463676.2463680","url":null,"abstract":"We describe our proposed demonstration of GeoDeepDive, a system that helps geoscientists discover information and knowledge buried in the text, tables, and figures of geology journal articles. This requires solving a host of classical data management challenges including data acquisition (e.g., from scanned documents), data extraction, and data integration. SIGMOD attendees will see demonstrations of three aspects of our system: (1) an end-to-end system that is of a high enough quality to perform novel geological science, but is written by a small enough team so that each aspect can be manageably explained; (2) a simple feature engineering system that allows a user to write in familiar SQL or Python; and (3) the effect of different sources of feedback on result quality including expert labeling, distant supervision, traditional rules, and crowd-sourced data.\u0000 Our prototype builds on our work integrating statistical inference and learning tools into traditional database systems. If successful, our demonstration will allow attendees to see that data processing systems that use machine learning contain many familiar data processing problems such as efficient querying, indexing, and supporting tools for database-backed websites, none of which are machine-learning problems, per se.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"3 1","pages":"993-996"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84485759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Split query processing in polybase 拆分查询处理在polybase
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463709
D. DeWitt, A. Halverson, Rimma V. Nehme, S. Shankar, J. Aguilar-Saborit, Artin Avanes, Miro Flasza, J. Gramling
This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.
本文介绍了SQL Server PDW V2的一个特性Polybase,它允许用户使用标准的SQL查询语言来管理和查询存储在Hadoop集群中的数据。其他数据库系统通过使用外部表机制只提供hdfs驻留数据的关系视图,而Polybase采用了拆分查询处理范式,其中hdfs驻留数据上的SQL操作符由PDW查询优化器转换为MapReduce作业,然后在Hadoop集群上执行。本文描述了Polybase的设计和实现,并进行了全面的性能评估,探讨了使用分割查询处理范式来执行涉及关系DBMS中的结构化数据和Hadoop中的非结构化数据的查询的好处。我们的结果表明,虽然使用基于分割的查询执行范例可以将某些查询的性能提高10倍,但在决定将SQL操作符推到Hadoop上是否有利时,必须使用基于成本的查询优化器,该优化器会考虑广泛的因素。这些因素包括谓词的选择性因素、两个簇的相对大小,以及它们的节点是否位于同一位置。此外,必须仔细考虑Java和SQL语言的语义差异,以避免改变查询的预期结果。
{"title":"Split query processing in polybase","authors":"D. DeWitt, A. Halverson, Rimma V. Nehme, S. Shankar, J. Aguilar-Saborit, Artin Avanes, Miro Flasza, J. Gramling","doi":"10.1145/2463676.2463709","DOIUrl":"https://doi.org/10.1145/2463676.2463709","url":null,"abstract":"This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"10 1","pages":"1255-1266"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87495680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 150
GRDB: a system for declarative and interactive analysis of noisy information networks GRDB:用于嘈杂信息网络的声明性和交互式分析的系统
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465257
W. E. Moustafa, Hui Miao, A. Deshpande, L. Getoor
There is a growing interest in methods for analyzing data describing networks of all types, including biological, physical, social, and scientific collaboration networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This demonstration presents GrDB, a system that enables data analysts to write declarative programs to specify and combine different network data cleaning tasks, visualize the output, and engage in the process of decision review and correction if necessary. The declarative interface of GrDB makes it very easy to quickly write analysis tasks and execute them over data, while the visual component facilitates debugging the program and performing fine grained corrections.
人们对各种类型网络(包括生物、物理、社会和科学合作网络)的数据分析方法越来越感兴趣。通常,描述这些网络的数据是观测到的,因此有噪声和不完整;对于有意义的数据分析,它通常处于错误的保真度和抽象级别。这个演示展示了GrDB,这个系统使数据分析人员能够编写声明性程序来指定和组合不同的网络数据清理任务,可视化输出,并在必要时参与决策审查和纠正过程。GrDB的声明性接口使得快速编写分析任务并在数据上执行它们变得非常容易,而可视化组件有助于调试程序并执行细粒度更正。
{"title":"GRDB: a system for declarative and interactive analysis of noisy information networks","authors":"W. E. Moustafa, Hui Miao, A. Deshpande, L. Getoor","doi":"10.1145/2463676.2465257","DOIUrl":"https://doi.org/10.1145/2463676.2465257","url":null,"abstract":"There is a growing interest in methods for analyzing data describing networks of all types, including biological, physical, social, and scientific collaboration networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This demonstration presents GrDB, a system that enables data analysts to write declarative programs to specify and combine different network data cleaning tasks, visualize the output, and engage in the process of decision review and correction if necessary. The declarative interface of GrDB makes it very easy to quickly write analysis tasks and execute them over data, while the visual component facilitates debugging the program and performing fine grained corrections.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"1 1","pages":"1085-1088"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89771466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Noah: a dynamic ridesharing system 诺亚:一个动态的拼车系统
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463695
Charles Tian, Y. Huang, Zhi Liu, F. Bastani, R. Jin
This demo presents Noah: a dynamic ridesharing system. Noah supports large scale real-time ridesharing with service guarantee on road networks. Taxis and trip requests are dynamically matched. Different from traditional systems, a taxi can have more than one customer on board given that all waiting time and service time constraints of trips are satisfied. Noah's real-time response relies on three main components: (1) a fast shortest path algorithm with caching on road networks; (2) fast dynamic matching algorithms to schedule ridesharing on the fly; (3) a spatial indexing method for fast retrieving moving taxis. Users will be able to submit requests from a smartphone, choose specific parameters such as number of taxis in the system, service constraints, and matching algorithms, to explore the internal functionalities and implementations of Noah. The system analyzer will show the system performance including average waiting time, average detour percentage, average response time, and average level of sharing. Taxis, routes, and requests will be animated and visualized through Google Maps API. The demo is based on trips of 17,000 Shanghai taxis for one day (May 29, 2009); the dataset contains 432,327 trips. Each trip includes the starting and destination coordinates and the start time. An iPhone application is implemented to allow users to submit a trip request to the Noah system during the demonstration.
这个演示展示了Noah:一个动态拼车系统。诺亚支持大规模实时拼车,并在道路网络上提供服务保障。出租车和出行请求是动态匹配的。与传统系统不同的是,在满足所有行程等待时间和服务时间限制的情况下,一辆出租车可以有多个乘客在车上。Noah的实时响应依赖于三个主要组成部分:(1)在道路网络上高速缓存的快速最短路径算法;(2)基于快速动态匹配算法的拼车调度;(3)一种快速检索移动出租车的空间索引方法。用户可以通过智能手机提交请求,选择特定参数,如系统中的出租车数量、服务约束和匹配算法,以探索Noah的内部功能和实现。系统分析器将显示系统性能,包括平均等待时间、平均绕行百分比、平均响应时间和平均共享级别。出租车、路线和请求将通过谷歌Maps API动画化和可视化。该演示基于上海一天(2009年5月29日)17000辆出租车的行程;该数据集包含432,327次旅行。每一次行程都包含起点和终点坐标以及开始时间。实现了一个iPhone应用程序,允许用户在演示期间向Noah系统提交旅行请求。
{"title":"Noah: a dynamic ridesharing system","authors":"Charles Tian, Y. Huang, Zhi Liu, F. Bastani, R. Jin","doi":"10.1145/2463676.2463695","DOIUrl":"https://doi.org/10.1145/2463676.2463695","url":null,"abstract":"This demo presents Noah: a dynamic ridesharing system. Noah supports large scale real-time ridesharing with service guarantee on road networks. Taxis and trip requests are dynamically matched. Different from traditional systems, a taxi can have more than one customer on board given that all waiting time and service time constraints of trips are satisfied. Noah's real-time response relies on three main components: (1) a fast shortest path algorithm with caching on road networks; (2) fast dynamic matching algorithms to schedule ridesharing on the fly; (3) a spatial indexing method for fast retrieving moving taxis. Users will be able to submit requests from a smartphone, choose specific parameters such as number of taxis in the system, service constraints, and matching algorithms, to explore the internal functionalities and implementations of Noah. The system analyzer will show the system performance including average waiting time, average detour percentage, average response time, and average level of sharing. Taxis, routes, and requests will be animated and visualized through Google Maps API. The demo is based on trips of 17,000 Shanghai taxis for one day (May 29, 2009); the dataset contains 432,327 trips. Each trip includes the starting and destination coordinates and the start time. An iPhone application is implemented to allow users to submit a trip request to the Noah system during the demonstration.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"9 1","pages":"985-988"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89777774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Parallel analytics as a service 并行分析即服务
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463714
Petrie Wong, Zhian He, Eric Lo
Recently, massively parallel processing relational database systems (MPPDBs) have gained much momentum in the big data analytic market. With the advent of hosted cloud computing, we envision that the offering of MPPDB-as-a-Service (MPPDBaaS) will become attractive for companies having analytical tasks on only hundreds gigabytes to some ten terabytes of data because they can enjoy high-end parallel analytics at a cheap cost. This paper presents Thrifty, a prototype implementation of MPPDB-as-a-service. The major research issue is how to achieve a lower total cost of ownership by consolidating thousands of MPPDB tenants on to a shared hardware infrastructure, with a performance SLA that guarantees the tenants can obtain the query results as if they are executing their queries on dedicated machines. Thrifty achieves the goal by using a tenant-driven design that includes (1) a cluster design that carefully arranges the nodes in the cluster into groups and creates an MPPDB for each group of nodes, (2) a tenant placement that assigns each tenant to several MPPDBs (for high availability service through replication), and (3) a query routing algorithm that routes a tenant's query to the proper MPPDB at run-time. Experiments show that in a MPPDBaaS with 5000 tenants, where each tenant requests 2 to 32 nodes MPPDB to query against 200GB to 3.2TB of data, Thrifty can serve all the tenants with a 99.9% performance SLA guarantee and a high availability replication factor of 3, using only 18.7% of the nodes requested by the tenants.
近年来,大规模并行处理关系数据库系统(mppdb)在大数据分析市场中获得了很大的发展势头。随着托管云计算的出现,我们设想MPPDB-as-a-Service (MPPDBaaS)的提供将对那些只有几百gb到10 tb数据的分析任务的公司变得有吸引力,因为他们可以以低廉的成本享受高端的并行分析。本文提出了一个mppdb即服务的原型实现Thrifty。主要的研究问题是如何通过将数千个MPPDB租户整合到共享的硬件基础设施上来实现更低的总拥有成本,并使用性能SLA保证租户可以获得查询结果,就像他们在专用机器上执行查询一样。Thrifty通过使用租户驱动的设计实现了这一目标,该设计包括:(1)将集群中的节点仔细地分组并为每组节点创建一个MPPDB的集群设计,(2)将每个租户分配给几个MPPDB的租户布局(通过复制实现高可用性服务),以及(3)查询路由算法,该算法在运行时将租户的查询路由到适当的MPPDB。实验表明,在一个有5000个租户的MPPDBaaS中,每个租户请求2到32个节点的MPPDB来查询200GB到3.2TB的数据,Thrifty可以为所有租户提供99.9%的性能SLA保证和3的高可用性复制因子,仅使用租户请求节点的18.7%。
{"title":"Parallel analytics as a service","authors":"Petrie Wong, Zhian He, Eric Lo","doi":"10.1145/2463676.2463714","DOIUrl":"https://doi.org/10.1145/2463676.2463714","url":null,"abstract":"Recently, massively parallel processing relational database systems (MPPDBs) have gained much momentum in the big data analytic market. With the advent of hosted cloud computing, we envision that the offering of MPPDB-as-a-Service (MPPDBaaS) will become attractive for companies having analytical tasks on only hundreds gigabytes to some ten terabytes of data because they can enjoy high-end parallel analytics at a cheap cost. This paper presents Thrifty, a prototype implementation of MPPDB-as-a-service. The major research issue is how to achieve a lower total cost of ownership by consolidating thousands of MPPDB tenants on to a shared hardware infrastructure, with a performance SLA that guarantees the tenants can obtain the query results as if they are executing their queries on dedicated machines. Thrifty achieves the goal by using a tenant-driven design that includes (1) a cluster design that carefully arranges the nodes in the cluster into groups and creates an MPPDB for each group of nodes, (2) a tenant placement that assigns each tenant to several MPPDBs (for high availability service through replication), and (3) a query routing algorithm that routes a tenant's query to the proper MPPDB at run-time. Experiments show that in a MPPDBaaS with 5000 tenants, where each tenant requests 2 to 32 nodes MPPDB to query against 200GB to 3.2TB of data, Thrifty can serve all the tenants with a 99.9% performance SLA guarantee and a high availability replication factor of 3, using only 18.7% of the nodes requested by the tenants.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"7 1","pages":"25-36"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89039381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
BigBench: towards an industry standard benchmark for big data analytics BigBench:迈向大数据分析的行业标准基准
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463712
A. Ghazal, T. Rabl, Minqing Hu, Francois Raab, Meikel Poess, A. Crolotte, H. Jacobsen
There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems. In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques. We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.
学术界、工业界和大量用户对大数据有着极大的兴趣。一些商业和开源提供商发布了各种产品来支持大数据存储和处理。随着这些产品的成熟,有必要评估和比较这些系统的性能。在本文中,我们提出了BigBench,一个端到端的大数据基准提案。BigBench的基本商业模式是产品零售商。该提案涵盖了一个数据模型和合成数据生成器,解决了包含结构化、半结构化和非结构化数据的大数据系统的多样性、速度和数量方面的问题。BigBench数据模型的结构化部分采用了TPC-DS基准,其中丰富了半结构化和非结构化数据组件。半结构化部分捕获零售商网站上的注册用户和访客用户点击。非结构化数据捕获在线提交的产品评论。为BigBench设计的数据生成器基于比例因子提供可扩展的原始数据量。BigBench工作负载是围绕一组针对数据模型的查询设计的。从商业角度来看,这些查询涵盖了麦肯锡提出的不同类别的大数据分析。从技术角度来看,基于数据源、查询处理类型和分析技术,查询被设计为跨越三个不同的维度。我们通过在Teradata Aster数据库上实现BigBench来说明其可行性。该测试包括生成和加载一个200gb的BigBench数据集,并通过执行BigBench查询(使用Teradata Aster SQL-MR编写)和报告其响应时间来测试工作负载。
{"title":"BigBench: towards an industry standard benchmark for big data analytics","authors":"A. Ghazal, T. Rabl, Minqing Hu, Francois Raab, Meikel Poess, A. Crolotte, H. Jacobsen","doi":"10.1145/2463676.2463712","DOIUrl":"https://doi.org/10.1145/2463676.2463712","url":null,"abstract":"There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems.\u0000 In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques.\u0000 We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"44 1","pages":"1197-1208"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76438956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 373
Trinity: a distributed graph engine on a memory cloud Trinity:一个基于内存云的分布式图形引擎
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2467799
Bin Shao, Haixun Wang, Yatao Li
Computations performed by graph algorithms are data driven, and require a high degree of random data access. Despite the great progresses made in disk technology, it still cannot provide the level of efficient random access required by graph computation. On the other hand, memory-based approaches usually do not scale due to the capacity limit of single machines. In this paper, we introduce Trinity, a general purpose graph engine over a distributed memory cloud. Through optimized memory management and network communication, Trinity supports fast graph exploration as well as efficient parallel computing. In particular, Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. These enable Trinity to support efficient online query processing and offline analytics on large graphs with just a few commodity machines. Furthermore, Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. Our experiments show Trinity's performance in both low latency graph queries as well as high throughput graph analytics on web-scale, billion-node graphs.
图算法执行的计算是数据驱动的,需要高度的随机数据访问。尽管磁盘技术取得了很大的进步,但它仍然无法提供图计算所需的高效随机访问水平。另一方面,由于单个机器的容量限制,基于内存的方法通常不能扩展。在本文中,我们介绍了Trinity,一个基于分布式存储云的通用图形引擎。通过优化的内存管理和网络通信,Trinity支持快速的图形探索以及高效的并行计算。特别是,Trinity利用在线和离线计算中的图形访问模式来优化内存和通信,以获得最佳性能。这使Trinity能够支持高效的在线查询处理和大型图形的离线分析,只需几台商品机器。此外,Trinity还提供了一种称为TSL的高级规范语言,供用户声明数据模式和通信协议,这为通用图形管理和计算带来了极大的易用性。我们的实验显示了Trinity在低延迟图查询和高吞吐量图分析方面的性能。
{"title":"Trinity: a distributed graph engine on a memory cloud","authors":"Bin Shao, Haixun Wang, Yatao Li","doi":"10.1145/2463676.2467799","DOIUrl":"https://doi.org/10.1145/2463676.2467799","url":null,"abstract":"Computations performed by graph algorithms are data driven, and require a high degree of random data access. Despite the great progresses made in disk technology, it still cannot provide the level of efficient random access required by graph computation. On the other hand, memory-based approaches usually do not scale due to the capacity limit of single machines. In this paper, we introduce Trinity, a general purpose graph engine over a distributed memory cloud. Through optimized memory management and network communication, Trinity supports fast graph exploration as well as efficient parallel computing. In particular, Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. These enable Trinity to support efficient online query processing and offline analytics on large graphs with just a few commodity machines. Furthermore, Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. Our experiments show Trinity's performance in both low latency graph queries as well as high throughput graph analytics on web-scale, billion-node graphs.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"24 1","pages":"505-516"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78223575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 461
String similarity measures and joins with synonyms 字符串相似度与同义词进行度量和连接
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465313
Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang
A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or q-grams. While these are indeed indicators of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, "Bill" is a short form of "William". Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures more effectively and efficiently, thereby boosting the quality of string matching. In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing structure called SI-tree, which combines both signature and length filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the results from an empirical study of the algorithms verify the effectiveness and efficiency of our approach.
字符串相似度度量量化两个文本字符串之间的相似度,以便进行近似的字符串匹配或比较。例如,字符串“Sam”和“Samuel”可以被认为是相似的。大多数现有的计算两个字符串相似度的工作只考虑语法相似度,例如,常用词的数量或q-grams。虽然这些确实是相似度的指示器,但在许多重要的情况下,语法上不同的字符串可以表示相同的现实世界对象。例如,“Bill”是“William”的缩写形式。给定预定义的同义词集合,本文的目的是探索这些现有知识,以便更有效地评估字符串相似度量,从而提高字符串匹配的质量。特别是,我们首先提出了一个基于扩展的框架,在考虑同义词的同时有效地测量字符串相似度。由于在相似性度量中使用同义词虽然具有表现力,但计算成本很高(NP-hard),因此我们提出了一种高效的算法,称为选择性扩展,它保证了许多实际场景中的最优性。然后,我们研究了一种新的索引结构,称为SI-tree,它结合了签名和长度过滤策略,用于有效的字符串与同义词的相似连接。我们开发了一个估计器来估计候选签名过滤器的大小,以实现签名过滤器的在线选择,进一步提高效率。该估计器提供了强大的低误差,高置信度保证,同时只需要对数空间和时间成本,从而使我们的方法在理论和实践中都具有吸引力。最后,通过对算法的实证研究,验证了本文方法的有效性和高效性。
{"title":"String similarity measures and joins with synonyms","authors":"Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang","doi":"10.1145/2463676.2465313","DOIUrl":"https://doi.org/10.1145/2463676.2465313","url":null,"abstract":"A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings \"Sam\" and \"Samuel\" can be considered similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or q-grams. While these are indeed indicators of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, \"Bill\" is a short form of \"William\". Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures more effectively and efficiently, thereby boosting the quality of string matching.\u0000 In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing structure called SI-tree, which combines both signature and length filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the results from an empirical study of the algorithms verify the effectiveness and efficiency of our approach.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"11 1","pages":"373-384"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85628703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Online search of overlapping communities 重叠社区的在线搜索
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463722
Wanyun Cui, Yanghua Xiao, Haixun Wang, Yiqi Lu, Wei Wang
A great deal of research has been conducted on modeling and discovering communities in complex networks. In most real life networks, an object often participates in multiple overlapping communities. In view of this, recent research has focused on mining overlapping communities in complex networks. The algorithms essentially materialize a snapshot of the overlapping communities in the network. This approach has three drawbacks, however. First, the mining algorithm uses the same global criterion to decide whether a subgraph qualifies as a community. In other words, the criterion is fixed and predetermined. But in reality, communities for different vertices may have very different characteristics. Second, it is costly, time consuming, and often unnecessary to find communities for an entire network. Third, the approach does not support dynamically evolving networks. In this paper, we focus on online search of overlapping communities, that is, given a query vertex, we find meaningful overlapping communities the vertex belongs to in an online manner. In doing so, each search can use community criterion tailored for the vertex in the search. To support this approach, we introduce a novel model for overlapping communities, and we provide theoretical guidelines for tuning the model. We present several algorithms for online overlapping community search and we conduct comprehensive experiments to demonstrate the effectiveness of the model and the algorithms. We also suggest many potential applications of our model and algorithms.
在复杂网络中,对社区的建模和发现进行了大量的研究。在大多数现实生活网络中,一个对象经常参与多个重叠的社区。鉴于此,近年来的研究主要集中在复杂网络中重叠社区的挖掘。该算法本质上是实现网络中重叠社区的快照。然而,这种方法有三个缺点。首先,挖掘算法使用相同的全局标准来确定子图是否符合社区的条件。换句话说,标准是固定的和预先确定的。但在现实中,不同顶点的群落可能具有非常不同的特征。其次,为整个网络寻找社区是昂贵、耗时的,而且往往是不必要的。第三,该方法不支持动态进化的网络。在本文中,我们关注重叠社区的在线搜索,即给定一个查询顶点,我们以在线的方式找到该顶点所属的有意义的重叠社区。这样,每次搜索都可以使用针对搜索中的顶点定制的社区标准。为了支持这种方法,我们为重叠社区引入了一个新的模型,并提供了调整模型的理论指导。我们提出了几种用于在线重叠社区搜索的算法,并进行了全面的实验来证明模型和算法的有效性。我们还提出了我们的模型和算法的许多潜在应用。
{"title":"Online search of overlapping communities","authors":"Wanyun Cui, Yanghua Xiao, Haixun Wang, Yiqi Lu, Wei Wang","doi":"10.1145/2463676.2463722","DOIUrl":"https://doi.org/10.1145/2463676.2463722","url":null,"abstract":"A great deal of research has been conducted on modeling and discovering communities in complex networks. In most real life networks, an object often participates in multiple overlapping communities. In view of this, recent research has focused on mining overlapping communities in complex networks. The algorithms essentially materialize a snapshot of the overlapping communities in the network. This approach has three drawbacks, however. First, the mining algorithm uses the same global criterion to decide whether a subgraph qualifies as a community. In other words, the criterion is fixed and predetermined. But in reality, communities for different vertices may have very different characteristics. Second, it is costly, time consuming, and often unnecessary to find communities for an entire network. Third, the approach does not support dynamically evolving networks. In this paper, we focus on online search of overlapping communities, that is, given a query vertex, we find meaningful overlapping communities the vertex belongs to in an online manner. In doing so, each search can use community criterion tailored for the vertex in the search. To support this approach, we introduce a novel model for overlapping communities, and we provide theoretical guidelines for tuning the model. We present several algorithms for online overlapping community search and we conduct comprehensive experiments to demonstrate the effectiveness of the model and the algorithms. We also suggest many potential applications of our model and algorithms.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"220 1 1","pages":"277-288"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86253461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 179
期刊
Proceedings. ACM-SIGMOD International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1