首页 > 最新文献

Proceedings. ACM-SIGMOD International Conference on Management of Data最新文献

英文 中文
RTP: robust tenant placement for elastic in-memory database clusters RTP:用于弹性内存中数据库集群的健壮的租户安置
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465302
J. Schaffner, Tim Januschowski, Mary H. Kercher, Tim Kraska, H. Plattner, M. Franklin, D. Jacobs
In the cloud services industry, a key issue for cloud operators is to minimize operational costs. In this paper, we consider algorithms that elastically contract and expand a cluster of in-memory databases depending on tenants' behavior over time while maintaining response time guarantees. We evaluate our tenant placement algorithms using traces obtained from one of SAP's production on-demand applications. Our experiments reveal that our approach lowers operating costs for the database cluster of this application by a factor of 2.2 to 10, measured in Amazon EC2 hourly rates, in comparison to the state of the art. In addition, we carefully study the trade-off between cost savings obtained by continuously migrating tenants and the robustness of servers towards load spikes and failures.
在云服务行业,云运营商面临的一个关键问题是最小化运营成本。在本文中,我们考虑根据租户的行为弹性收缩和扩展内存数据库集群的算法,同时保持响应时间保证。我们使用从SAP的按需生产应用程序之一获得的跟踪来评估我们的租户安置算法。我们的实验表明,与现有技术相比,我们的方法将该应用程序的数据库集群的操作成本降低了2.2到10倍(以Amazon EC2小时费率衡量)。此外,我们还仔细研究了通过持续迁移租户获得的成本节约与服务器对负载峰值和故障的健壮性之间的权衡。
{"title":"RTP: robust tenant placement for elastic in-memory database clusters","authors":"J. Schaffner, Tim Januschowski, Mary H. Kercher, Tim Kraska, H. Plattner, M. Franklin, D. Jacobs","doi":"10.1145/2463676.2465302","DOIUrl":"https://doi.org/10.1145/2463676.2465302","url":null,"abstract":"In the cloud services industry, a key issue for cloud operators is to minimize operational costs. In this paper, we consider algorithms that elastically contract and expand a cluster of in-memory databases depending on tenants' behavior over time while maintaining response time guarantees.\u0000 We evaluate our tenant placement algorithms using traces obtained from one of SAP's production on-demand applications. Our experiments reveal that our approach lowers operating costs for the database cluster of this application by a factor of 2.2 to 10, measured in Amazon EC2 hourly rates, in comparison to the state of the art. In addition, we carefully study the trade-off between cost savings obtained by continuously migrating tenants and the robustness of servers towards load spikes and failures.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91227980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Inter-media hashing for large-scale retrieval from heterogeneous data sources 用于从异构数据源进行大规模检索的跨媒体散列
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465274
Jingkuan Song, Yang Yang, Yi Yang, Zi-Liang Huang, Heng Tao Shen
In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users' demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query's results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques.
本文提出了一种新的多媒体检索范式,以创新异构多媒体数据的大规模检索。它能够从异构数据源返回不同媒体类型的结果,例如,使用查询图像从不同的数据源检索相关的文本文档或图像。这利用了来自不同来源的广泛可用的数据,满足了当前用户同时接收包含多种类型数据的结果列表的需求,从而获得对查询结果的全面理解。为了实现大规模的跨媒体检索,我们提出了一种新的跨媒体哈希(IMH)模型来探索来自不同数据源的多种媒体类型之间的相关性,并解决可扩展性问题。为此,将来自异构数据源的多媒体数据转换成一个通用的汉明空间,通过异或和位计数操作可以很容易地实现快速搜索。此外,我们还集成了一个线性回归模型来学习哈希函数,以便有效地生成新数据点的哈希码。在真实世界的大规模多媒体数据集上进行的实验表明,与最先进的技术相比,我们提出的方法具有优越性。
{"title":"Inter-media hashing for large-scale retrieval from heterogeneous data sources","authors":"Jingkuan Song, Yang Yang, Yi Yang, Zi-Liang Huang, Heng Tao Shen","doi":"10.1145/2463676.2465274","DOIUrl":"https://doi.org/10.1145/2463676.2465274","url":null,"abstract":"In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users' demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query's results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86489332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 517
Iterative parallel data processing with stratosphere: an inside look 基于平流层的迭代并行数据处理:内部观察
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463693
Stephan Ewen, Sebastian Schelter, K. Tzoumas, Daniel Warneke, V. Markl
Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms. In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithm's code and a visualization of the produced data flow programs. The second step shows the optimizer's execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates. To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.
迭代算法出现在数据分析的许多领域,如机器学习或图分析。随着人们对在非常大的数据集上运行这些算法越来越感兴趣,我们看到需要新的技术来以大规模并行的方式执行迭代。在之前的工作中,我们已经展示了如何扩展和使用并行数据流系统,以便在无共享环境中有效地运行迭代算法。我们的方法支持许多这些算法的增量处理特性。在本演示提案中,我们使用图分析和机器学习的示例说明了在Stratosphere上实现,编译,优化和执行迭代算法的过程。对于第一步,我们展示了算法的代码和生成的数据流程序的可视化。第二步显示优化器的执行计划选择,而最后一个阶段监视程序的执行,可视化操作符的状态和附加指标,例如每次迭代运行时和更新次数。为了显示数据流抽象支持轻松创建自定义编程API,我们还提供了针对轻量级Pregel API编写的程序,该API通过少量编程工作分层在我们的系统之上。
{"title":"Iterative parallel data processing with stratosphere: an inside look","authors":"Stephan Ewen, Sebastian Schelter, K. Tzoumas, Daniel Warneke, V. Markl","doi":"10.1145/2463676.2463693","DOIUrl":"https://doi.org/10.1145/2463676.2463693","url":null,"abstract":"Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms.\u0000 In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithm's code and a visualization of the produced data flow programs. The second step shows the optimizer's execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates.\u0000 To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89301429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
A demonstration of SQLVM: performance isolation in multi-tenant relational database-as-a-service SQLVM:多租户关系数据库即服务中的性能隔离演示
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463686
Vivek R. Narasayya, Sudipto Das, M. Syamala, B. Chandramouli, S. Chaudhuri
Sharing resources of a single database server among multiple tenants is common in multi-tenant Database-as-a-Service providers, such as Microsoft SQL Azure. Multi-tenancy enables cost reduction for the cloud service provider which it can pass on as savings to the tenants. However, resource sharing can adversely affect a tenant's performance due to other tenants' workloads contending for shared resources. Service providers today do not provide any assurances to a tenant in terms of isolating its performance from other co-located tenants. SQLVM, a project at Microsoft Research, is an abstraction for performance isolation which is built on a promise of reserving key database server resources, such as CPU, I/O and memory, for each tenant. The key challenge is in supporting this abstraction within a RDBMS without statically allocating resources to tenants, while ensuring low overheads and scaling to large numbers of tenants. This demonstration will show how SQLVM can effectively isolate a tenant's performance from other tenant workloads co-located at the same database server. Our demonstration will use various scripted scenarios and a data collection and visualization framework to illustrate performance isolation using SQLVM.
在多租户数据库即服务(database -as-a- service)提供商(如Microsoft SQL Azure)中,在多个租户之间共享单个数据库服务器的资源很常见。多租户可以降低云服务提供商的成本,并将其作为节省转嫁给租户。但是,资源共享可能会对租户的性能产生不利影响,因为其他租户的工作负载会争夺共享资源。目前,服务提供商不向租户提供任何保证,将其性能与其他共址租户隔离开来。SQLVM是微软研究院的一个项目,是性能隔离的抽象,它建立在为每个租户保留关键数据库服务器资源(如CPU、I/O和内存)的承诺之上。关键的挑战是在不向租户静态分配资源的情况下在RDBMS中支持这种抽象,同时确保低开销并可扩展到大量租户。此演示将展示SQLVM如何有效地将租户的性能与位于同一数据库服务器上的其他租户工作负载隔离开来。我们的演示将使用各种脚本场景以及数据收集和可视化框架来说明使用SQLVM进行性能隔离。
{"title":"A demonstration of SQLVM: performance isolation in multi-tenant relational database-as-a-service","authors":"Vivek R. Narasayya, Sudipto Das, M. Syamala, B. Chandramouli, S. Chaudhuri","doi":"10.1145/2463676.2463686","DOIUrl":"https://doi.org/10.1145/2463676.2463686","url":null,"abstract":"Sharing resources of a single database server among multiple tenants is common in multi-tenant Database-as-a-Service providers, such as Microsoft SQL Azure. Multi-tenancy enables cost reduction for the cloud service provider which it can pass on as savings to the tenants. However, resource sharing can adversely affect a tenant's performance due to other tenants' workloads contending for shared resources. Service providers today do not provide any assurances to a tenant in terms of isolating its performance from other co-located tenants. SQLVM, a project at Microsoft Research, is an abstraction for performance isolation which is built on a promise of reserving key database server resources, such as CPU, I/O and memory, for each tenant. The key challenge is in supporting this abstraction within a RDBMS without statically allocating resources to tenants, while ensuring low overheads and scaling to large numbers of tenants. This demonstration will show how SQLVM can effectively isolate a tenant's performance from other tenant workloads co-located at the same database server. Our demonstration will use various scripted scenarios and a data collection and visualization framework to illustrate performance isolation using SQLVM.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89322704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
PARAS: interactive parameter space exploration for association rule mining PARAS:关联规则挖掘的交互式参数空间探索
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465245
Abhishek Mukherji, Xika Lin, Christopher R. Botaish, Jason Whitehouse, Elke A. Rundensteiner, M. Ward, Carolina Ruiz
We demonstrate our PARAS technology for supporting interactive association mining at near real-time speeds. Key technical innovations of PARAS, in particular, stable region abstractions and rule redundancy management supporting novel parameter space-centric exploratory queries will be showcased. The audience will be able to interactively explore the parameter space view of rules. They will experience near real-time speeds achieved by PARAS for operations, such as comparing rule sets mined using different parameter values, that would otherwise take hours of computation and much manual investigation. Overall, we will demonstrate that the PARAS system provides a rich experience to data analysts through parameter tuning recommendations while significantly reducing the trial-and-error interactions.
我们展示了我们的PARAS技术,用于支持接近实时速度的交互式关联挖掘。将展示PARAS的关键技术创新,特别是稳定的区域抽象和规则冗余管理,支持新的以参数空间为中心的探索性查询。观众将能够交互式地探索规则的参数空间视图。他们将体验到PARAS实现的接近实时的操作速度,例如比较使用不同参数值挖掘的规则集,否则将花费数小时的计算和大量的人工调查。总之,我们将演示PARAS系统通过参数调优建议为数据分析人员提供丰富的体验,同时显著减少试错交互。
{"title":"PARAS: interactive parameter space exploration for association rule mining","authors":"Abhishek Mukherji, Xika Lin, Christopher R. Botaish, Jason Whitehouse, Elke A. Rundensteiner, M. Ward, Carolina Ruiz","doi":"10.1145/2463676.2465245","DOIUrl":"https://doi.org/10.1145/2463676.2465245","url":null,"abstract":"We demonstrate our PARAS technology for supporting interactive association mining at near real-time speeds. Key technical innovations of PARAS, in particular, stable region abstractions and rule redundancy management supporting novel parameter space-centric exploratory queries will be showcased. The audience will be able to interactively explore the parameter space view of rules. They will experience near real-time speeds achieved by PARAS for operations, such as comparing rule sets mined using different parameter values, that would otherwise take hours of computation and much manual investigation. Overall, we will demonstrate that the PARAS system provides a rich experience to data analysts through parameter tuning recommendations while significantly reducing the trial-and-error interactions.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88163181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Petabyte scale databases and storage systems at Facebook Facebook的pb级数据库和存储系统
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2463713
Dhruba Borthakur
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack data store for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of click logs and combine it with the power of Apache HBase to store all Facebook Messages. This paper describes the reasons why each of these databases is appropriate for that workload and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We describe the techniques we have used to map the Facebook Graph Database into a set of relational tables. We speak of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database. Esteemed researchers in the Database Management community have benchmarked query latencies on Hive/Hadoop to be less performant than a traditional Parallel DBMS. We describe why these benchmarks are insufficient for Big Data deployments and why we continue to use Hadoop/Hive. We present an alternate set of benchmark techniques that measure capacity of a database, the value/byte in that database and the efficiency of inbuilt crowd-sourcing techniques to reduce administration costs of that database.
在Facebook,我们使用各种类型的数据库和存储系统来满足不同应用程序的需求。围绕这些数据存储系统构建的解决方案有一组共同的要求:它们必须具有高度可扩展性,维护成本应该很低,并且必须高效地执行。我们使用一个分片mySQL+memcache解决方案来支持实时访问数十pb的数据,我们使用TAO来提供跨地理距离的web级数据库的一致性。我们使用Haystack数据存储存储我们每周托管的30亿张新照片。我们使用Apache Hadoop从100 pb的点击日志中挖掘情报,并将其与Apache HBase的强大功能结合起来存储所有Facebook消息。本文描述了为什么这些数据库都适合于这种工作负载,以及在实现这些解决方案时所做的设计决策和权衡。我们将讨论这些解决方案的一致性、可用性和分区容忍度。我们将讨论其中一些系统需要ACID语义而其他系统不需要的原因。我们描述了将Facebook图形数据库映射到一组关系表的技术。我们谈到了我们计划如何跨地理位置进行大数据部署,以及我们对新型纯内存和基于纯ssd的事务性数据库的需求。数据库管理社区中受人尊敬的研究人员对Hive/Hadoop上的查询延迟进行了基准测试,发现它的性能低于传统的并行DBMS。我们描述了为什么这些基准对大数据部署来说是不够的,以及为什么我们继续使用Hadoop/Hive。我们提出了另一组基准测试技术,用于测量数据库的容量、该数据库中的值/字节以及用于降低该数据库管理成本的内置众包技术的效率。
{"title":"Petabyte scale databases and storage systems at Facebook","authors":"Dhruba Borthakur","doi":"10.1145/2463676.2463713","DOIUrl":"https://doi.org/10.1145/2463676.2463713","url":null,"abstract":"At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack data store for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of click logs and combine it with the power of Apache HBase to store all Facebook Messages.\u0000 This paper describes the reasons why each of these databases is appropriate for that workload and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We describe the techniques we have used to map the Facebook Graph Database into a set of relational tables. We speak of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.\u0000 Esteemed researchers in the Database Management community have benchmarked query latencies on Hive/Hadoop to be less performant than a traditional Parallel DBMS. We describe why these benchmarks are insufficient for Big Data deployments and why we continue to use Hadoop/Hive. We present an alternate set of benchmark techniques that measure capacity of a database, the value/byte in that database and the efficiency of inbuilt crowd-sourcing techniques to reduce administration costs of that database.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85351532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
CARTILAGE: adding flexibility to the Hadoop skeleton 软骨:为Hadoop骨架增加灵活性
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465258
Alekh Jindal, Jorge-Arnulfo Quiané-Ruiz, S. Madden
Modern enterprises have to deal with a variety of analytical queries over very large datasets. In this respect, Hadoop has gained much popularity since it scales to thousand of nodes and terabytes of data. However, Hadoop suffers from poor performance, especially in I/O performance. Several works have proposed alternate data storage for Hadoop in order to improve the query performance. However, many of these works end up making deep changes in Hadoop or HDFS. As a result, they are (i) difficult to adopt by several users, and (ii) not compatible with future Hadoop releases. In this paper, we present CARTILAGE, a comprehensive data storage framework built on top of HDFS. CARTILAGE allows users full control over their data storage, including data partitioning, data replication, data layouts, and data placement. Furthermore, CARTILAGE can be layered on top of an existing HDFS installation. This means that Hadoop, as well as other query engines, can readily make use of CARTILAGE. We describe several use-cases of CARTILAGE and propose to demonstrate the flexibility and efficiency of CARTILAGE through a set of novel scenarios.
现代企业必须在非常大的数据集上处理各种分析查询。在这方面,Hadoop已经获得了很大的普及,因为它可以扩展到数千个节点和tb级的数据。然而,Hadoop的性能很差,特别是在I/O性能方面。为了提高查询性能,已经有几篇文章提出了Hadoop的替代数据存储。然而,这些工作中的许多最终在Hadoop或HDFS中进行了深刻的更改。因此,它们(i)难以被多个用户采用,(ii)与未来的Hadoop版本不兼容。在本文中,我们介绍了一个基于HDFS的综合数据存储框架——软骨。软骨允许用户完全控制他们的数据存储,包括数据分区、数据复制、数据布局和数据放置。此外,软骨可以在现有HDFS安装的基础上分层。这意味着Hadoop以及其他查询引擎可以很容易地利用软骨。我们描述了软骨的几个用例,并建议通过一组新颖的场景来展示软骨的灵活性和效率。
{"title":"CARTILAGE: adding flexibility to the Hadoop skeleton","authors":"Alekh Jindal, Jorge-Arnulfo Quiané-Ruiz, S. Madden","doi":"10.1145/2463676.2465258","DOIUrl":"https://doi.org/10.1145/2463676.2465258","url":null,"abstract":"Modern enterprises have to deal with a variety of analytical queries over very large datasets. In this respect, Hadoop has gained much popularity since it scales to thousand of nodes and terabytes of data. However, Hadoop suffers from poor performance, especially in I/O performance. Several works have proposed alternate data storage for Hadoop in order to improve the query performance. However, many of these works end up making deep changes in Hadoop or HDFS. As a result, they are (i) difficult to adopt by several users, and (ii) not compatible with future Hadoop releases. In this paper, we present CARTILAGE, a comprehensive data storage framework built on top of HDFS. CARTILAGE allows users full control over their data storage, including data partitioning, data replication, data layouts, and data placement. Furthermore, CARTILAGE can be layered on top of an existing HDFS installation. This means that Hadoop, as well as other query engines, can readily make use of CARTILAGE. We describe several use-cases of CARTILAGE and propose to demonstrate the flexibility and efficiency of CARTILAGE through a set of novel scenarios.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79703805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs 描述租户行为,以便在多租户dbms中进行安置和缓解危机
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465308
Aaron J. Elmore, Sudipto Das, A. Pucher, D. Agrawal, A. E. Abbadi, Xifeng Yan
A multitenant database management system (DBMS) in the cloud must continuously monitor the trade-off between efficient resource sharing among multiple application databases (tenants) and their performance. Considering the scale of attn{hundreds to} thousands of tenants in such multitenant DBMSs, manual approaches for continuous monitoring are not tenable. A self-managing controller of a multitenant DBMS faces several challenges. For instance, how to characterize a tenant given its variety of workloads, how to reduce the impact of tenant colocation, and how to detect and mitigate a performance crisis where one or more tenants' desired service level objective (SLO) is not achieved. We present Delphi, a self-managing system controller for a multitenant DBMS, and Pythia, a technique to learn behavior through observation and supervision using DBMS-agnostic database level performance measures. Pythia accurately learns tenant behavior even when multiple tenants share a database process, learns good and bad tenant consolidation plans (or packings), and maintains a pertenant history to detect behavior changes. Delphi detects performance crises, and leverages Pythia to suggests remedial actions using a hill-climbing search algorithm to identify a new tenant placement strategy to mitigate violating SLOs. Our evaluation using a variety of tenant types and workloads shows that Pythia can learn a tenant's behavior with more than 92% accuracy and learn the quality of packings with more than 86% accuracy. During a performance crisis, Delphi is able to reduce 99th percentile latencies by 80%, and can consolidate 45% more tenants than a greedy baseline, which balances tenant load without modeling tenant behavior.
云中的多租户数据库管理系统(DBMS)必须持续监控多个应用程序数据库(租户)之间的有效资源共享及其性能之间的权衡。考虑到这种多租户dbms中数百到数千个租户的规模,持续监控的手动方法是站不住脚的。多租户DBMS的自管理控制器面临几个挑战。例如,如何在给定各种工作负载的情况下描述租户的特征,如何减少租户托管的影响,以及如何在一个或多个租户的预期服务水平目标(SLO)未实现时检测和减轻性能危机。我们介绍了Delphi,一种多租户DBMS的自我管理系统控制器,以及Pythia,一种通过使用与DBMS无关的数据库级性能度量通过观察和监督来学习行为的技术。即使多个租户共享一个数据库进程,Pythia也能准确地学习租户行为,学习好的和坏的租户整合计划(或打包),并维护一个百分比历史记录以检测行为变化。Delphi检测性能危机,并利用Pythia使用爬坡搜索算法建议补救措施,以确定新的租户安置策略,以减轻违反slo的情况。我们使用各种租户类型和工作负载进行的评估表明,Pythia学习租户行为的准确率超过92%,学习包装质量的准确率超过86%。在性能危机期间,Delphi能够将第99百分位延迟减少80%,并且可以合并比贪婪基线多45%的租户,从而在不建模租户行为的情况下平衡租户负载。
{"title":"Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs","authors":"Aaron J. Elmore, Sudipto Das, A. Pucher, D. Agrawal, A. E. Abbadi, Xifeng Yan","doi":"10.1145/2463676.2465308","DOIUrl":"https://doi.org/10.1145/2463676.2465308","url":null,"abstract":"A multitenant database management system (DBMS) in the cloud must continuously monitor the trade-off between efficient resource sharing among multiple application databases (tenants) and their performance. Considering the scale of attn{hundreds to} thousands of tenants in such multitenant DBMSs, manual approaches for continuous monitoring are not tenable. A self-managing controller of a multitenant DBMS faces several challenges. For instance, how to characterize a tenant given its variety of workloads, how to reduce the impact of tenant colocation, and how to detect and mitigate a performance crisis where one or more tenants' desired service level objective (SLO) is not achieved.\u0000 We present Delphi, a self-managing system controller for a multitenant DBMS, and Pythia, a technique to learn behavior through observation and supervision using DBMS-agnostic database level performance measures. Pythia accurately learns tenant behavior even when multiple tenants share a database process, learns good and bad tenant consolidation plans (or packings), and maintains a pertenant history to detect behavior changes. Delphi detects performance crises, and leverages Pythia to suggests remedial actions using a hill-climbing search algorithm to identify a new tenant placement strategy to mitigate violating SLOs. Our evaluation using a variety of tenant types and workloads shows that Pythia can learn a tenant's behavior with more than 92% accuracy and learn the quality of packings with more than 86% accuracy. During a performance crisis, Delphi is able to reduce 99th percentile latencies by 80%, and can consolidate 45% more tenants than a greedy baseline, which balances tenant load without modeling tenant behavior.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82587154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Mobile interaction and query optimizationin a protein-ligand data analysis system 蛋白质配体数据分析系统的移动交互与查询优化
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465344
Marvin Lapeine, K. Herbert, Emily Hill, N. Goodey
With current trends in integrating phylogenetic analysis into pharma-research, computing systems that integrate the two areas can help the drug discovery field. DrugTree is a tool that overlays ligand data on a protein-motivated phylogenetic tree. While initial tests of DrugTree are successful, it has been noticed that there are a number of lags concerning querying the tree. Due to the interleaving nature of the data, query optimization can become problematic since the data is being obtained from multiple sources, integrated and then presented to the user with the phylogenetic imposed upon the phylogenetic analysis layer. This poster presents our initial methodologies for addressing the query optimization issues. Our approach applies standards as well as uses novel mechanisms to help improve performance time.
随着目前将系统发育分析整合到药物研究中的趋势,整合这两个领域的计算系统可以帮助药物发现领域。DrugTree是一个工具,覆盖配体数据上的蛋白质驱动的系统发育树。虽然DrugTree的初步测试是成功的,但人们已经注意到,在查询该树时存在一些滞后。由于数据的交错特性,查询优化可能会出现问题,因为数据是从多个来源获得的,经过集成,然后在系统发生分析层上以系统发生方式呈现给用户。这张海报展示了我们解决查询优化问题的初步方法。我们的方法应用标准,并使用新的机制来帮助提高性能时间。
{"title":"Mobile interaction and query optimizationin a protein-ligand data analysis system","authors":"Marvin Lapeine, K. Herbert, Emily Hill, N. Goodey","doi":"10.1145/2463676.2465344","DOIUrl":"https://doi.org/10.1145/2463676.2465344","url":null,"abstract":"With current trends in integrating phylogenetic analysis into pharma-research, computing systems that integrate the two areas can help the drug discovery field. DrugTree is a tool that overlays ligand data on a protein-motivated phylogenetic tree. While initial tests of DrugTree are successful, it has been noticed that there are a number of lags concerning querying the tree. Due to the interleaving nature of the data, query optimization can become problematic since the data is being obtained from multiple sources, integrated and then presented to the user with the phylogenetic imposed upon the phylogenetic analysis layer. This poster presents our initial methodologies for addressing the query optimization issues. Our approach applies standards as well as uses novel mechanisms to help improve performance time.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83601266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Timeline index: a unified data structure for processing queries on temporal data in SAP HANA 时间轴索引:在SAP HANA中处理时间数据查询的统一数据结构
Pub Date : 2013-06-22 DOI: 10.1145/2463676.2465293
Martin Kaufmann, Amin Amiri Manjili, Panagiotis Vagenas, Peter M. Fischer, Donald Kossmann, Franz Färber, Norman May
Managing temporal data is becoming increasingly important for many applications. Several database systems already support the time dimension, but provide only few temporal operators, which also often exhibit poor performance characteristics. On the academic side, a large number of algorithms and data structures have been proposed, but they often address a subset of these temporal operators only. In this paper, we develop the Timeline Index as a novel, unified data structure that efficiently supports temporal operators such as temporal aggregation, time travel, and temporal joins. As the Timeline Index is independent of the physical order of the data, it provides flexibility in physical design; e.g., it supports any kind of compression scheme, which is crucial for main memory column stores. Our experiments show that the Timeline Index has predictable performance and beats state-of-the-art approaches significantly, sometimes by orders of magnitude.
对于许多应用程序来说,管理时态数据变得越来越重要。一些数据库系统已经支持时间维度,但只提供了很少的时间操作符,这也经常表现出较差的性能特征。在学术方面,已经提出了大量的算法和数据结构,但它们通常只处理这些时间算子的子集。在本文中,我们开发了时间轴索引作为一种新的、统一的数据结构,它有效地支持时间算子,如时间聚合、时间旅行和时间连接。由于时间轴索引与数据的物理顺序无关,因此它提供了物理设计的灵活性;例如,它支持任何类型的压缩方案,这对于主存列存储是至关重要的。我们的实验表明,时间轴索引具有可预测的性能,并且显著优于最先进的方法,有时甚至超过数量级。
{"title":"Timeline index: a unified data structure for processing queries on temporal data in SAP HANA","authors":"Martin Kaufmann, Amin Amiri Manjili, Panagiotis Vagenas, Peter M. Fischer, Donald Kossmann, Franz Färber, Norman May","doi":"10.1145/2463676.2465293","DOIUrl":"https://doi.org/10.1145/2463676.2465293","url":null,"abstract":"Managing temporal data is becoming increasingly important for many applications. Several database systems already support the time dimension, but provide only few temporal operators, which also often exhibit poor performance characteristics. On the academic side, a large number of algorithms and data structures have been proposed, but they often address a subset of these temporal operators only. In this paper, we develop the Timeline Index as a novel, unified data structure that efficiently supports temporal operators such as temporal aggregation, time travel, and temporal joins. As the Timeline Index is independent of the physical order of the data, it provides flexibility in physical design; e.g., it supports any kind of compression scheme, which is crucial for main memory column stores. Our experiments show that the Timeline Index has predictable performance and beats state-of-the-art approaches significantly, sometimes by orders of magnitude.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77917838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
期刊
Proceedings. ACM-SIGMOD International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1