Proceedings of the 2017 Symposium on Cloud Computing最新文献_第5页

Efficient and consistent replication for distributed logs 分布式日志的高效和一致复制

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132695

Hua Fan, Jeffrey Pound, P. Bumbulis, Nathan Auch, Scott MacLean, Eric Garber, Anil K. Goel

Distributed shared logs are a powerful building block for distributed systems. By providing fault-tolerant persistence and strong ordering guarantees, applications can use a distributed shared log to reliably communicate a stream of events between processes. This can be used, for example, to replicate application state or to build a reliable publish/subscribe system. The log itself must also replicate data in order to provide availability and fault-tolerance. Key to the design of a distributed shared log is the choice of replication algorithm, which will determine many properties of the system. We propose an algorithm for consistent replication of log data, quorum-replication with meta-data exchange (QMX), that is linearizable while allowing writes to be successful with only a single round-trip to a quorum of replicas and allowing reads to generally be serviced by any single replica, or read-one/write-quorum. This is achieved by coupling the reads with an asynchronous message exchange algorithm that continuously runs amongst the replicas. The message exchange algorithm allows replicas to infer the global state of writes across the cluster, in order to deduce which writes have been successfully quorum replicated and which have not. This metadata allows any single replica to directly answer reads in many cases, though in the worst case a read must wait for the message passing round to complete before being serviced which requires a majority quorum of servers to be responsive.

分布式共享日志是分布式系统的强大构建块。通过提供容错持久性和强排序保证，应用程序可以使用分布式共享日志在进程之间可靠地通信事件流。例如，这可以用于复制应用程序状态或构建可靠的发布/订阅系统。日志本身还必须复制数据，以提供可用性和容错性。分布式共享日志设计的关键是复制算法的选择，它将决定系统的许多属性。我们提出了一种用于日志数据一致复制的算法，即具有元数据交换的仲裁复制(QMX)，它是线性化的，同时允许写入仅通过一次往返副本的仲裁就能成功，并且允许读取通常由任何单个副本或读一/写仲裁提供服务。这是通过将读取与在副本之间连续运行的异步消息交换算法耦合来实现的。消息交换算法允许副本推断整个集群的写的全局状态，以便推断哪些写已经成功地进行了仲裁复制，哪些没有。在许多情况下，此元数据允许任何单个副本直接响应读取，尽管在最坏的情况下，读取必须等待消息传递完成才能得到服务，这需要大多数服务器的仲裁才能响应。

{"title":"Efficient and consistent replication for distributed logs","authors":"Hua Fan, Jeffrey Pound, P. Bumbulis, Nathan Auch, Scott MacLean, Eric Garber, Anil K. Goel","doi":"10.1145/3127479.3132695","DOIUrl":"https://doi.org/10.1145/3127479.3132695","url":null,"abstract":"Distributed shared logs are a powerful building block for distributed systems. By providing fault-tolerant persistence and strong ordering guarantees, applications can use a distributed shared log to reliably communicate a stream of events between processes. This can be used, for example, to replicate application state or to build a reliable publish/subscribe system. The log itself must also replicate data in order to provide availability and fault-tolerance. Key to the design of a distributed shared log is the choice of replication algorithm, which will determine many properties of the system. We propose an algorithm for consistent replication of log data, quorum-replication with meta-data exchange (QMX), that is linearizable while allowing writes to be successful with only a single round-trip to a quorum of replicas and allowing reads to generally be serviced by any single replica, or read-one/write-quorum. This is achieved by coupling the reads with an asynchronous message exchange algorithm that continuously runs amongst the replicas. The message exchange algorithm allows replicas to infer the global state of writes across the cluster, in order to deduce which writes have been successfully quorum replicated and which have not. This metadata allows any single replica to directly answer reads in many cases, though in the worst case a read must wait for the message passing round to complete before being serviced which requires a majority quorum of servers to be responsive.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"336 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80644300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QFrag: distributed graph search via subgraph isomorphism 基于子图同构的分布式图搜索

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3131625

M. Serafini, G. D. F. Morales, Georgos Siganos

This paper introduces QFrag, a distributed system for graph search on top of bulk synchronous processing (BSP) systems such as MapReduce and Spark. Searching for patterns in graphs is an important and computationally complex problem. Most current distributed search systems scale to graphs that do not fit in main memory by partitioning the input graph. For analytical queries, however, this approach entails running expensive distributed joins on large intermediate data. In this paper we explore an alternative approach: replicating the input graph and running independent parallel instances of a sequential graph search algorithm. In principle, this approach leads us to an embarrassingly parallel problem, since workers can complete their tasks in parallel without coordination. However, the skew present in natural graphs makes this problem a deceitfully parallel one, i.e., an embarrassingly parallel problem with poor load balancing. We therefore introduce a task fragmentation technique that avoids stragglers but at the same time minimizes coordination. Our evaluation shows that QFrag outperforms BSP-based systems by orders of magnitude, and performs similar to asynchronous MPI-based systems on simple queries. Furthermore, it is able to run computationally complex analytical queries that other systems are unable to handle.

本文介绍了基于MapReduce和Spark等批量同步处理(BSP)系统的分布式图形搜索系统qrag。在图中搜索模式是一个重要且计算复杂的问题。大多数当前的分布式搜索系统通过划分输入图来扩展到不适合主内存的图。然而，对于分析查询，这种方法需要在大型中间数据上运行昂贵的分布式连接。在本文中，我们探索了一种替代方法:复制输入图并运行顺序图搜索算法的独立并行实例。原则上，这种方法会导致一个令人尴尬的并行问题，因为工人可以并行地完成他们的任务，而不需要协调。然而，自然图中存在的倾斜使这个问题成为一个欺骗性的并行问题，即一个令人尴尬的并行问题，具有较差的负载平衡。因此，我们引入了一种任务分段技术，以避免掉队者，同时最大限度地减少协调。我们的评估表明，QFrag的性能比基于bsp的系统好几个数量级，并且在简单查询上的性能与基于异步mpi的系统相似。此外，它能够运行其他系统无法处理的计算复杂的分析查询。

{"title":"QFrag: distributed graph search via subgraph isomorphism","authors":"M. Serafini, G. D. F. Morales, Georgos Siganos","doi":"10.1145/3127479.3131625","DOIUrl":"https://doi.org/10.1145/3127479.3131625","url":null,"abstract":"This paper introduces QFrag, a distributed system for graph search on top of bulk synchronous processing (BSP) systems such as MapReduce and Spark. Searching for patterns in graphs is an important and computationally complex problem. Most current distributed search systems scale to graphs that do not fit in main memory by partitioning the input graph. For analytical queries, however, this approach entails running expensive distributed joins on large intermediate data. In this paper we explore an alternative approach: replicating the input graph and running independent parallel instances of a sequential graph search algorithm. In principle, this approach leads us to an embarrassingly parallel problem, since workers can complete their tasks in parallel without coordination. However, the skew present in natural graphs makes this problem a deceitfully parallel one, i.e., an embarrassingly parallel problem with poor load balancing. We therefore introduce a task fragmentation technique that avoids stragglers but at the same time minimizes coordination. Our evaluation shows that QFrag outperforms BSP-based systems by orders of magnitude, and performs similar to asynchronous MPI-based systems on simple queries. Furthermore, it is able to run computationally complex analytical queries that other systems are unable to handle.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81962859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Resilient cloud in dynamic resource environments 动态资源环境中的弹性云

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132571

Fan Yang, A. Chien, Haryadi S. Gunawi

Traditional cloud stacks are designed to tolerate random, small-scale failures, and can successfully deliver highly-available cloud services and interactive services to end users. However, they fail to survive large-scale disruptions that are caused by major power outage, cyber-attack, or region/zone failures. Such changes trigger cascading failures and significant service outages. We propose to understand the reasons for these failures, and create reliable data services that can efficiently and robustly tolerate such large-scale resource changes. We believe cloud services will need to survive frequent, large dynamic resource changes in the future to be highly available. (1) Significant new challenges to cloud reliability are emerging, including cyber-attacks, power/network outages, and so on. For example, human error disrupted Amazon S3 service on 02/28/17 [2]. Recently hackers are even attacking electric utilities, which may lead to more outages [3, 6]. (2) Increased attention on resource cost optimization will increase usage dynamism, such as Amazon Spot Instances [1]. (3) Availability focused cloud applications will increasingly practice continuous testing to ensure they have no hidden source of catastrophic failure. For example, Netflix Simian Army can simulate the outages of individual servers, and even an entire AWS region [4]. (4) Cloud applications with dynamic flexibility will reap numerous benefits, such as flexible deployments, managing cost arbitrage and reliability arbitrage across cloud provides and datacenters, etc. Using Apache Cassandra [5] as the model system, we characterize its failure behavior under dynamic datacenter-scale resource changes. Each datacenter is volatile and randomly shut down with a given duty factor. We simulate read-only workload on a quorum-based system deployed across multiple datacenters, varying (1) system scale, (2) the fraction of volatile datacenters, and (3) the duty factor of volatile datacenters. We explore the space of various configurations, including replication factors and consistency levels, and measure the service availability (% of succeeded requests) and replication overhead (number of total replicas). Our results show that, in a volatile resource environment, the current replication and quorum protocols in Cassandra-like systems cannot high availability and consistency with low replication overhead. Our contributions include: (1) Detailed characterization of failures under dynamic datacenter-scale resource changes, showing that the exiting protocols in quorum-based systems cannot achieve high availability and consistency with low replication cost. (2) Study of the best achieve-able availability of data service in dynamic datacenter-scale resource environment.

传统的云堆栈被设计为能够容忍随机的、小规模的故障，并且能够成功地向最终用户交付高可用性的云服务和交互式服务。然而，它们无法在主要停电、网络攻击或区域/区域故障造成的大规模中断中存活下来。这样的更改会触发级联故障和严重的服务中断。我们建议了解这些故障的原因，并创建可靠的数据服务，以有效和健壮地容忍这种大规模的资源变化。我们相信，云服务需要在未来频繁的、大规模的动态资源变化中存活下来，才能保持高可用性。(1)对云可靠性的重大新挑战正在出现，包括网络攻击、电力/网络中断等。例如，2017年2月28日，人为错误导致Amazon S3服务中断[2]。最近黑客甚至攻击电力设施，这可能导致更多的停电[3,6]。(2)增加对资源成本优化的关注将增加使用的动态性，例如Amazon Spot Instances[1]。(3)关注可用性的云应用程序将越来越多地进行持续测试，以确保它们没有隐藏的灾难性故障来源。例如，Netflix的Simian Army可以模拟单个服务器甚至整个AWS区域的中断[4]。(4)具有动态灵活性的云应用程序将获得许多好处，例如灵活部署，管理跨云提供商和数据中心的成本套利和可靠性套利等。我们使用Apache Cassandra[5]作为模型系统，描述了其在动态数据中心规模资源变化下的失效行为。每个数据中心都是不稳定的，并且在给定的占空比下随机关闭。我们在跨多个数据中心部署的基于quorum的系统上模拟只读工作负载，改变(1)系统规模，(2)易失性数据中心的比例，以及(3)易失性数据中心的占空系数。我们将探索各种配置的空间，包括复制因素和一致性级别，并度量服务可用性(成功请求的百分比)和复制开销(总副本数)。我们的研究结果表明，在一个易变的资源环境中，当前的复制和仲裁协议在类cassandra系统中无法在低复制开销的情况下实现高可用性和一致性。我们的贡献包括:(1)详细描述了动态数据中心规模资源变化下的故障，表明现有协议在基于群体的系统中无法以低复制成本实现高可用性和一致性。(2)动态数据中心规模资源环境下数据服务的最佳可实现可用性研究。

{"title":"Resilient cloud in dynamic resource environments","authors":"Fan Yang, A. Chien, Haryadi S. Gunawi","doi":"10.1145/3127479.3132571","DOIUrl":"https://doi.org/10.1145/3127479.3132571","url":null,"abstract":"Traditional cloud stacks are designed to tolerate random, small-scale failures, and can successfully deliver highly-available cloud services and interactive services to end users. However, they fail to survive large-scale disruptions that are caused by major power outage, cyber-attack, or region/zone failures. Such changes trigger cascading failures and significant service outages. We propose to understand the reasons for these failures, and create reliable data services that can efficiently and robustly tolerate such large-scale resource changes. We believe cloud services will need to survive frequent, large dynamic resource changes in the future to be highly available. (1) Significant new challenges to cloud reliability are emerging, including cyber-attacks, power/network outages, and so on. For example, human error disrupted Amazon S3 service on 02/28/17 [2]. Recently hackers are even attacking electric utilities, which may lead to more outages [3, 6]. (2) Increased attention on resource cost optimization will increase usage dynamism, such as Amazon Spot Instances [1]. (3) Availability focused cloud applications will increasingly practice continuous testing to ensure they have no hidden source of catastrophic failure. For example, Netflix Simian Army can simulate the outages of individual servers, and even an entire AWS region [4]. (4) Cloud applications with dynamic flexibility will reap numerous benefits, such as flexible deployments, managing cost arbitrage and reliability arbitrage across cloud provides and datacenters, etc. Using Apache Cassandra [5] as the model system, we characterize its failure behavior under dynamic datacenter-scale resource changes. Each datacenter is volatile and randomly shut down with a given duty factor. We simulate read-only workload on a quorum-based system deployed across multiple datacenters, varying (1) system scale, (2) the fraction of volatile datacenters, and (3) the duty factor of volatile datacenters. We explore the space of various configurations, including replication factors and consistency levels, and measure the service availability (% of succeeded requests) and replication overhead (number of total replicas). Our results show that, in a volatile resource environment, the current replication and quorum protocols in Cassandra-like systems cannot high availability and consistency with low replication overhead. Our contributions include: (1) Detailed characterization of failures under dynamic datacenter-scale resource changes, showing that the exiting protocols in quorum-based systems cannot achieve high availability and consistency with low replication cost. (2) Study of the best achieve-able availability of data service in dynamic datacenter-scale resource environment.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88317122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SLAQ: quality-driven scheduling for distributed machine learning SLAQ:分布式机器学习的质量驱动调度

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3127490

Haoyu Zhang, Logan Stafman, Andrew Or, M. Freedman

Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highly-tailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73% and an average delay reduction of up to 44% on a large set of ML training jobs, compared to resource fairness schedulers.

使用大型数据集训练机器学习(ML)模型可能会在共享集群上引起严重的资源争用。这种训练通常涉及许多不断改进模型质量的迭代。然而，在探索性环境中，通过将资源引导到最有改进潜力的工作上，可以更快地获得更好的模型。我们描述了SLAQ，一个近似ML训练作业的集群调度系统，旨在最大化整体作业质量。在分配集群资源时，SLAQ探索跨多个作业的质量-运行时权衡，以最大限度地提高系统范围的质量。为此，SLAQ利用ML训练算法的迭代特性，从并发作业中收集质量和资源使用信息，然后为未来的迭代生成高度定制的质量改进预测。实验表明，与资源公平调度器相比，SLAQ在大量ML训练任务上实现了高达73%的平均质量改进和高达44%的平均延迟减少。

引用次数: 120

RStore: efficient multiversion document management in the cloud RStore:云端高效的多版本文档管理

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132693

Souvik Bhattacherjee, A. Deshpande

Motivation.The iterative and exploratory nature of the data science process, combined with an increasing need to support debugging, historical queries, auditing, provenance, and reproducibility, warrants the need to store and query a large number of versions of a dataset. This realization has led to many efforts at building data management systems that support versioning as a first-class construct, both in academia [1, 3, 5, 6] and in industry (e.g., git, Datomic, noms). These systems typically support rich versioning/branching functionality and complex queries over versioned information but lack the capability to host versions of a collection of keyed records or documents in a distributed environment or a cloud. Alternatively, key-value stores1 (e.g., Apache Cassandra, HBase, MongoDB) are appealing in many collaborative scenarios spanning geographically distributed teams, since they offer centralized hosting of the data, are resilient to failures, can easily scale out, and can handle a large number of queries efficiently. However, those do not offer rich versioning and branching functionality akin to hosted version control systems (VCS) like GitHub. This work addresses the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those. RStore Overview. Our primary focus here is to provide versioning and branching support for collections of records with unique identifiers. Like popular NoSQL systems, RStore supports a flexible data model; records with varying sizes, ranging from a few bytes to a few MBs; and a variety of retrieval queries to cover a wide range of use cases. Specifically, similar to NoSQL systems, our system supports efficient retrieval of a specific record in a specific version (given a key and a version identifier), or the entire evolution history for a given key. Similar to VCS, it supports retrieving all records belonging to a specific version to support use cases that require updating a large number of records (e.g., by applying a data cleaning step). Finally, since retrieving an entire version might be unnecessary and expensive, our system supports partial version retrieval given a range of keys and a version identifier. Challenges. Addressing the above desiderata poses many design and computational challenges, and natural baseline approaches (see full paper [2] for more details) that attempt to build this functionality on top of existing key-value stores suffer from critical limitations. First, most of those baseline approaches cannot directly support point queries targetting a specific record in a specific version (and by extension, full or partial version retrieval queries), without constructing and maintaining explicit indexes. Second, all the viable baselines fundamentally require too many back-and-forths between the retrieval module and the backend key-value store; this

动机。数据科学过程的迭代性和探索性，加上支持调试、历史查询、审计、来源和再现性的需求不断增长，保证了存储和查询大量数据集版本的需求。在学术界[1,3,5,6]和工业界(例如git、Datomic、noms)中，这种认识导致了许多构建支持版本控制的数据管理系统的努力。这些系统通常支持丰富的版本控制/分支功能和对版本控制信息的复杂查询，但缺乏在分布式环境或云环境中托管键控记录或文档集合的版本的能力。另外，键值存储(例如，Apache Cassandra, HBase, MongoDB)在许多跨地理分布团队的协作场景中很有吸引力，因为它们提供集中的数据托管，对故障具有弹性，可以轻松扩展，并且可以有效地处理大量查询。然而，它们不提供丰富的版本控制和分支功能，类似于托管版本控制系统(VCS)，如GitHub。这项工作解决了在分布式环境中紧凑存储一组关键文档或记录的大量版本(快照)的问题，同时有效地回答了对这些文档或记录的各种检索查询。RStore概述。我们在这里的主要重点是为具有唯一标识符的记录集合提供版本控制和分支支持。与流行的NoSQL系统一样，RStore支持灵活的数据模型;大小不等的记录，从几个字节到几mb不等;以及各种检索查询，以覆盖广泛的用例。具体来说，与NoSQL系统类似，我们的系统支持对特定版本(给定键和版本标识符)中的特定记录进行高效检索，或者对给定键的整个演化历史进行检索。与VCS类似，它支持检索属于特定版本的所有记录，以支持需要更新大量记录的用例(例如，通过应用数据清理步骤)。最后，由于检索整个版本可能是不必要且昂贵的，因此我们的系统支持给定一系列键和版本标识符的部分版本检索。挑战。解决上述需求带来了许多设计和计算方面的挑战，并且试图在现有键值存储之上构建此功能的自然基线方法(请参阅全文[2]了解更多详细信息)受到严重限制。首先，如果不构造和维护显式索引，大多数基线方法都不能直接支持针对特定版本中的特定记录的点查询(以及扩展为完整或部分版本检索查询)。其次，所有可行的基线基本上都需要在检索模块和后端键值存储之间进行太多的来回转换;这是因为所需的记录集不能简洁地描述为查询。第三，对于大多数基线方法来说，摄取新版本是困难的。最后，利用“记录级压缩”在这些方法中是困难的或不可能的;这对于能够处理大型记录(例如，文档)经常以相对较小的更改更新的常见用例是至关重要的。关键的想法。为了解决这些问题，RStore采用了一种新的架构，将不同的记录划分为大约相等大小的“块”，目标是在给定的查询工作负载下最小化需要检索的块的数量[2]。通过几个简单的调优旋钮，系统就可以适应不同的数据和工作负载需求。关键的计算挑战归结为决定如何以最佳方式将记录划分为块;我们把一些研究得很好的问题联系起来，比如压缩二分图和超图划分，以表明这个问题通常是np困难的。我们的系统采用了一种新颖的算法，该算法利用版本图的结构来找到有效的记录分区，并建立在Apache Cassandra之上。在大量综合构建的数据集上进行了广泛的实验评估，以显示RStore的有效性并验证我们的设计决策。

{"title":"RStore: efficient multiversion document management in the cloud","authors":"Souvik Bhattacherjee, A. Deshpande","doi":"10.1145/3127479.3132693","DOIUrl":"https://doi.org/10.1145/3127479.3132693","url":null,"abstract":"Motivation.The iterative and exploratory nature of the data science process, combined with an increasing need to support debugging, historical queries, auditing, provenance, and reproducibility, warrants the need to store and query a large number of versions of a dataset. This realization has led to many efforts at building data management systems that support versioning as a first-class construct, both in academia [1, 3, 5, 6] and in industry (e.g., git, Datomic, noms). These systems typically support rich versioning/branching functionality and complex queries over versioned information but lack the capability to host versions of a collection of keyed records or documents in a distributed environment or a cloud. Alternatively, key-value stores1 (e.g., Apache Cassandra, HBase, MongoDB) are appealing in many collaborative scenarios spanning geographically distributed teams, since they offer centralized hosting of the data, are resilient to failures, can easily scale out, and can handle a large number of queries efficiently. However, those do not offer rich versioning and branching functionality akin to hosted version control systems (VCS) like GitHub. This work addresses the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those. RStore Overview. Our primary focus here is to provide versioning and branching support for collections of records with unique identifiers. Like popular NoSQL systems, RStore supports a flexible data model; records with varying sizes, ranging from a few bytes to a few MBs; and a variety of retrieval queries to cover a wide range of use cases. Specifically, similar to NoSQL systems, our system supports efficient retrieval of a specific record in a specific version (given a key and a version identifier), or the entire evolution history for a given key. Similar to VCS, it supports retrieving all records belonging to a specific version to support use cases that require updating a large number of records (e.g., by applying a data cleaning step). Finally, since retrieving an entire version might be unnecessary and expensive, our system supports partial version retrieval given a range of keys and a version identifier. Challenges. Addressing the above desiderata poses many design and computational challenges, and natural baseline approaches (see full paper [2] for more details) that attempt to build this functionality on top of existing key-value stores suffer from critical limitations. First, most of those baseline approaches cannot directly support point queries targetting a specific record in a specific version (and by extension, full or partial version retrieval queries), without constructing and maintaining explicit indexes. Second, all the viable baselines fundamentally require too many back-and-forths between the retrieval module and the backend key-value store; this ","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90255107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Trustable virtual machine scheduling in a cloud 云中的可信虚拟机调度

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3128608

Fabien Hermenier, L. Henrio

In an Infrastructure As A Service (IaaS) cloud, the scheduler deploys VMs to servers according to service level objectives (SLOs). Clients and service providers must both trust the infrastructure. In particular they must be sure that the VM scheduler takes decisions that are consistent with its advertised behaviour. The difficulties to master every theoretical and practical aspects of a VM scheduler implementation leads however to faulty behaviours that break SLOs and reduce the provider revenues. We present SafePlace, a specification and testing framework that exhibits inconsistencies in VM schedulers. SafePlace mixes a DSL to formalise scheduling decisions with fuzz testing to generate a large spectrum of test cases and automatically report implementation faults. We evaluate SafePlace on the VM scheduler BtrPlace. Without any code modification, SafePlace allows to write test campaigns that are 3.83 times smaller than BtrPlace unit tests. SafePlace performs 200 tests per second, exhibited new non-trivial bugs, and outperforms the BtrPlace runtime assertion system.

在基础设施即服务(IaaS)云中，调度器根据服务级别目标(slo)将虚拟机部署到服务器。客户端和服务提供者都必须信任基础设施。特别是，他们必须确保VM调度器做出的决定与它所宣传的行为一致。然而，掌握VM调度程序实现的每个理论和实践方面的困难会导致错误的行为，从而破坏slo并减少提供商的收入。我们提出SafePlace，这是一个规范和测试框架，它显示了VM调度器中的不一致性。SafePlace将用于形式化调度决策的DSL与模糊测试混合在一起，生成大量测试用例，并自动报告实现错误。我们在VM调度器BtrPlace上评估SafePlace。在不修改任何代码的情况下，SafePlace允许编写比BtrPlace单元测试小3.83倍的测试活动。SafePlace每秒执行200个测试，显示新的重要错误，并且优于BtrPlace运行时断言系统。

引用次数: 3

GLoop: an event-driven runtime for consolidating GPGPU applications GLoop:用于整合GPGPU应用程序的事件驱动运行时

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132023

Yusuke Suzuki, H. Yamada, S. Kato, K. Kono

Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.

图形处理单元(Graphics processing unit, gpu)已经成为通用计算(general-purpose computing, GPGPU)的一个有吸引力的平台。将gpu作为时间复用资源是在多租户云平台中整合GPGPU应用(app)的关键。然而，先进的GPGPU应用程序对整合提出了新的挑战。这种高功能的GPGPU应用程序，被称为GPU吞食者，可以很容易地垄断共享的GPU，并饿死配置的GPGPU应用程序。本文介绍了GLoop，它是一个软件运行时，使我们能够整合包括GPU食者在内的GPGPU应用程序。GLoop提供了一个事件驱动的编程模型，它允许基于GLoop的应用程序继承GPU吞食者的高功能，同时以隔离的方式在共享GPU上按比例调度它们。我们实现了一个GLoop的原型，并在上面移植了8个GPU吞食器。实验结果表明，我们的原型在调度策略的基础上成功地调度了整合的GPGPU应用程序，并在它们之间隔离了资源。

引用次数: 7

ALOHA-KV: high performance read-only and write-only distributed transactions ALOHA-KV:高性能只读和只写分布式事务

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3127487

Hua Fan, W. Golab, C. B. Morrey

There is a trend in recent database research to pursue coordination avoidance and weaker transaction isolation under a long-standing assumption: concurrent serializable transactions under read-write or write-write conflicts require costly synchronization, and thus may incur a steep price in terms of performance. In particular, distributed transactions, which access multiple data items atomically, are considered inherently costly. They require concurrency control for transaction isolation since both read-write and write-write conflicts are possible, and they rely on distributed commitment protocols to ensure atomicity in the presence of failures. This paper presents serializable read-only and write-only distributed transactions as a counterexample to show that concurrent transactions can be processed in parallel with low-overhead despite conflicts. Inspired by the slotted ALOHA network protocol, we propose a simpler and leaner protocol for serializable read-only write-only transactions, which uses only one round trip to commit a transaction in the absence of failures irrespective of contention. Our design is centered around an epoch-based concurrency control (ECC) mechanism that minimizes synchronization conflicts and uses a small number of additional messages whose cost is amortized across many transactions. We integrate this protocol into ALOHA-KV, a scalable distributed key-value store for read-only write-only transactions, and demonstrate that the system can process close to 15 million read/write operations per second per server when each transaction batches together thousands of such operations.

在最近的数据库研究中，有一种趋势是在一个长期存在的假设下追求避免协调和较弱的事务隔离:读写或写写冲突下并发的可序列化事务需要昂贵的同步，因此可能会在性能方面付出高昂的代价。特别是，自动访问多个数据项的分布式事务被认为是代价高昂的。它们需要对事务隔离进行并发控制，因为读写冲突和写写冲突都是可能的，并且它们依赖于分布式承诺协议来确保出现故障时的原子性。本文以可序列化的只读和只写分布式事务作为反例，说明尽管存在冲突，但并发事务可以以低开销并行处理。受开槽ALOHA网络协议的启发，我们为可串行化的只读只读事务提出了一个更简单、更精简的协议，它只使用一次往返来提交事务，在没有故障的情况下，无论是否存在争用。我们的设计以基于时代的并发控制(ECC)机制为中心，该机制可以最大限度地减少同步冲突，并使用少量的额外消息，其成本在许多事务中分摊。我们将该协议集成到ALOHA-KV中，ALOHA-KV是一个可扩展的分布式键值存储，用于只读只读事务，并证明当每个事务将数千个这样的操作批处理在一起时，系统可以在每个服务器上每秒处理近1500万个读/写操作。

{"title":"ALOHA-KV: high performance read-only and write-only distributed transactions","authors":"Hua Fan, W. Golab, C. B. Morrey","doi":"10.1145/3127479.3127487","DOIUrl":"https://doi.org/10.1145/3127479.3127487","url":null,"abstract":"There is a trend in recent database research to pursue coordination avoidance and weaker transaction isolation under a long-standing assumption: concurrent serializable transactions under read-write or write-write conflicts require costly synchronization, and thus may incur a steep price in terms of performance. In particular, distributed transactions, which access multiple data items atomically, are considered inherently costly. They require concurrency control for transaction isolation since both read-write and write-write conflicts are possible, and they rely on distributed commitment protocols to ensure atomicity in the presence of failures. This paper presents serializable read-only and write-only distributed transactions as a counterexample to show that concurrent transactions can be processed in parallel with low-overhead despite conflicts. Inspired by the slotted ALOHA network protocol, we propose a simpler and leaner protocol for serializable read-only write-only transactions, which uses only one round trip to commit a transaction in the absence of failures irrespective of contention. Our design is centered around an epoch-based concurrency control (ECC) mechanism that minimizes synchronization conflicts and uses a small number of additional messages whose cost is amortized across many transactions. We integrate this protocol into ALOHA-KV, a scalable distributed key-value store for read-only write-only transactions, and demonstrate that the system can process close to 15 million read/write operations per second per server when each transaction batches together thousands of such operations.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"145 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85148008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Disaggregated operating system 分解操作系统

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3131617

Yizhou Shan, Sumukh Hallymysore, Yutong Huang, Yilun Chen, Yiying Zhang

Recently, there is an emerging trend to move towards a disaggregated hardware architecture that breaks monolithic servers into independent hardware components that are connected to a fast, scalable network [1, 2]. The disaggregated architecture offers several benefits over traditional monolithic server model, including better resource utilization, ease of hardware deployment, and support for heterogeneity. Our vision of the future disaggregated architecture is that each component will have its own controller to manage its hardware and can communicate with other components through a fast network.

最近，出现了一种新的趋势，即转向分解的硬件架构，将单片服务器分解为独立的硬件组件，这些硬件组件连接到快速，可扩展的网络[1,2]。与传统的单片服务器模型相比，分解的体系结构提供了几个优点，包括更好的资源利用、易于硬件部署和对异构性的支持。我们对未来分解架构的设想是，每个组件都将拥有自己的控制器来管理其硬件，并可以通过快速网络与其他组件进行通信。

引用次数: 2

Preserving I/O prioritization in virtualized OSes 在虚拟化操作系统中保持I/O优先级

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3127484

Kun Suo, Yong Zhao, J. Rao, Luwei Cheng, Xiaobo Zhou, F. Lau

While virtualization helps to enable multi-tenancy in data centers, it introduces new challenges to the resource management in traditional OSes. We find that one important design in an OS, prioritizing interactive and I/O-bound workloads, can become ineffective in a virtualized OS. Resource multiplexing between multiple tenants breaks the assumption of continuous CPU availability in physical systems and causes two types of priority inversions in virtualized OSes. In this paper, we present xBalloon, a lightweight approach to preserving I/O prioritization. It uses a balloon process in the virtualized OS to avoid priority inversion in both short-term and long-term scheduling. Experiments in a local Xen environment and Amazon EC2 show that xBalloon improves I/O performance in a recent Linux kernel by as much as 136% on network throughput, 95% on disk throughput, and 125x on network tail latency.

虽然虚拟化有助于在数据中心实现多租户，但它给传统操作系统中的资源管理带来了新的挑战。我们发现，操作系统中的一个重要设计，即交互式和I/ o负载的优先级，在虚拟操作系统中可能变得无效。多个租户之间的资源复用打破了物理系统中CPU持续可用的假设，并在虚拟化操作系统中导致两种类型的优先级反转。在本文中，我们介绍了xBalloon，这是一种保持I/O优先级的轻量级方法。它在虚拟操作系统中使用气球进程来避免短期和长期调度中的优先级反转。在本地Xen环境和Amazon EC2中进行的实验表明，在最新的Linux内核中，xBalloon将I/O性能在网络吞吐量方面提高了136%，在磁盘吞吐量方面提高了95%，在网络尾部延迟方面提高了125x。

引用次数: 12