Proceedings of the 2017 Symposium on Cloud Computing最新文献_第2页

Siphon: a high-performance substrate for inter-datacenter transfers in wide-area data analytics 虹吸:用于广域数据分析中数据中心间传输的高性能基板

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132561

Shuhao Liu, Li Chen, Baochun Li

Large volumes of raw data are naturally generated in multiple geographical regions. Modern data parallel frameworks may suffer from a degraded performance, due to the much lower inter-datacenter WAN bandwidth. We present Siphon, a new high-performance substrate that improves the ability of modern data parallel frameworks to optimize WAN transfers. Siphon aggregates inter-datacenter flows and routes them with the awareness of runtime network capacities. Following the principles of software-defined networking, Siphon is designed to accommodate a variety of flow scheduling algorithms that can improve the job level performance.

在多个地理区域自然会产生大量的原始数据。由于数据中心间广域网带宽低得多，现代数据并行框架的性能可能会下降。我们提出了Siphon，一种新的高性能基板，它提高了现代数据并行框架优化广域网传输的能力。Siphon聚合数据中心间流，并根据运行时网络容量对其进行路由。Siphon遵循软件定义网络的原则，设计用于适应各种可以提高作业级性能的流调度算法。

引用次数: 3

DFS-container: achieving containerized block I/O for distributed file systems DFS-container:实现分布式文件系统的容器化块I/O

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132568

Dan Huang, J. Wang, Qing Liu, Xuhong Zhang, Xunchao Chen, Jian Zhou

Today BigData systems commonly use resource management systems such as TORQUE, Mesos, and Google Borg to share the physical resources among users or applications. Enabled by virtualization, users can run their applications on the same node with low mutual interference. Container-based virtualizations (e.g., Docker and Linux Containers) offer a lightweight virtualization layer, which promises a near-native performance and is adopted by some Big-Data resource sharing platforms such as Mesos. Nevertheless, using containers to consolidate the I/O resources of shared storage systems is still at an early stage, especially in a distributed file system (DFS) such as Hadoop File System (HDFS). To overcome this issue, we propose a distributed middleware system, DFS-Container, by further containerizing DFS. We also evaluate and analyze the unfairness of using containers to proportionally allocate the I/O resource of DFS. Based on these analyses and evaluations, we propose and implement a new mechanism, IOPS-Regulator, which improve the fairness of proportional allocation by 74.4% on average.

如今，BigData系统通常使用TORQUE、Mesos和Google Borg等资源管理系统在用户或应用程序之间共享物理资源。通过虚拟化，用户可以在同一节点上运行他们的应用程序，相互干扰很小。基于容器的虚拟化(如Docker和Linux container)提供了一个轻量级的虚拟化层，它保证了接近本地的性能，并被一些大数据资源共享平台(如Mesos)所采用。然而，使用容器来整合共享存储系统的I/O资源仍然处于早期阶段，特别是在分布式文件系统(DFS)中，如Hadoop文件系统(HDFS)。为了克服这个问题，我们提出了一个分布式中间件系统，DFS- container，通过进一步容器化DFS。我们还评估和分析了使用容器按比例分配DFS的I/O资源的不公平性。基于这些分析和评估，我们提出并实施了一种新的机制，即IOPS-Regulator，该机制使比例分配的公平性平均提高了74.4%。

引用次数: 3

Exploiting speculation in partially replicated transactional data stores 利用部分复制的事务数据存储中的推测

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132692

Zhongmiao Li, P. V. Roy, P. Romano

Online services are often deployed over geographically-scattered data centers (geo-replication), which allows services to be highly available and reduces access latency. On the down side, to provide ACID transactions, global certification (i.e., across data centers) is needed to detect conflicts between concurrent transactions executing at different data centers. The global certification phase reduces throughput because transactions need to hold pre-commit locks, and it increases client-perceived latency because global certification lies in the critical path of transaction execution.

在线服务通常部署在地理位置分散的数据中心(地理复制)上，这允许服务具有高可用性并减少访问延迟。缺点是，要提供ACID事务，需要全局认证(即跨数据中心)来检测在不同数据中心执行的并发事务之间的冲突。全局认证阶段降低了吞吐量，因为事务需要持有预提交锁，并且它增加了客户端感知到的延迟，因为全局认证位于事务执行的关键路径中。

引用次数: 1

No data left behind: real-time insights from a complex data ecosystem 无数据遗漏:来自复杂数据生态系统的实时洞察

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3131208

M. Karpathiotakis, A. Floratou, Fatma Özcan, A. Ailamaki

The typical enterprise data architecture consists of several actively updated data sources (e.g., NoSQL systems, data warehouses), and a central data lake such as HDFS, in which all the data is periodically loaded through ETL processes. To simplify query processing, state-of-the-art data analysis approaches solely operate on top of the local, historical data in the data lake, and ignore the fresh tail end of data that resides in the original remote sources. However, as many business operations depend on real-time analytics, this approach is no longer viable. The alternative is hand-crafting the analysis task to explicitly consider the characteristics of the various data sources and identify optimization opportunities, rendering the overall analysis non-declarative and convoluted. Based on our experiences operating in data lake environments, we design System-PV, a real-time analytics system that masks the complexity of dealing with multiple data sources while offering minimal response times. System-PV extends Spark with a sophisticated data virtualization module that supports multiple applications - from SQL queries to machine learning. The module features a location-aware compiler that considers source complexity, and a two-phase optimizer that produces and refines the query plans, not only for SQL queries but for all other types of analysis as well. The experiments show that System-PV is often faster than Spark by more than an order of magnitude. In addition, the experiments show that the approach of accessing both the historical and the remote fresh data is viable, as it performs comparably to solely operating on top of the local, historical data.

典型的企业数据架构由几个主动更新的数据源(例如，NoSQL系统，数据仓库)和一个中心数据湖(例如HDFS)组成，其中所有数据都通过ETL进程定期加载。为了简化查询处理，最先进的数据分析方法只对数据湖中的本地历史数据进行操作，而忽略了驻留在原始远程数据源中的数据的新尾部。然而，由于许多业务操作依赖于实时分析，这种方法不再可行。另一种方法是手工制作分析任务，以显式地考虑各种数据源的特征并确定优化机会，从而使整体分析变得非声明性和复杂。基于我们在数据湖环境中运行的经验，我们设计了system - pv，这是一个实时分析系统，它掩盖了处理多个数据源的复杂性，同时提供最小的响应时间。System-PV通过一个复杂的数据虚拟化模块扩展了Spark，该模块支持从SQL查询到机器学习的多种应用程序。该模块的特点是一个考虑源复杂性的位置感知编译器，以及一个生成和改进查询计划的两阶段优化器，不仅适用于SQL查询，也适用于所有其他类型的分析。实验表明，System-PV通常比Spark快一个数量级以上。此外，实验表明，同时访问历史和远程新数据的方法是可行的，因为它的性能与仅在本地历史数据上操作相比。

{"title":"No data left behind: real-time insights from a complex data ecosystem","authors":"M. Karpathiotakis, A. Floratou, Fatma Özcan, A. Ailamaki","doi":"10.1145/3127479.3131208","DOIUrl":"https://doi.org/10.1145/3127479.3131208","url":null,"abstract":"The typical enterprise data architecture consists of several actively updated data sources (e.g., NoSQL systems, data warehouses), and a central data lake such as HDFS, in which all the data is periodically loaded through ETL processes. To simplify query processing, state-of-the-art data analysis approaches solely operate on top of the local, historical data in the data lake, and ignore the fresh tail end of data that resides in the original remote sources. However, as many business operations depend on real-time analytics, this approach is no longer viable. The alternative is hand-crafting the analysis task to explicitly consider the characteristics of the various data sources and identify optimization opportunities, rendering the overall analysis non-declarative and convoluted. Based on our experiences operating in data lake environments, we design System-PV, a real-time analytics system that masks the complexity of dealing with multiple data sources while offering minimal response times. System-PV extends Spark with a sophisticated data virtualization module that supports multiple applications - from SQL queries to machine learning. The module features a location-aware compiler that considers source complexity, and a two-phase optimizer that produces and refines the query plans, not only for SQL queries but for all other types of analysis as well. The experiments show that System-PV is often faster than Spark by more than an order of magnitude. In addition, the experiments show that the approach of accessing both the historical and the remote fresh data is viable, as it performs comparably to solely operating on top of the local, historical data.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"137 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86290114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Distributed shared persistent memory 分布式共享持久内存

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3128610

Yizhou Shan, Shin-Yeh Tsai, Yiying Zhang

Next-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance. They have the potential to benefit many datacenter applications. However, most previous research on NVMs has focused on using them in a single machine environment. It is still unclear how to best utilize them in distributed, datacenter environments. We introduce Distributed Shared Persistent Memory (DSPM), a new framework for using persistent memories in distributed data-center environments. DSPM provides a new abstraction that allows applications to both perform traditional memory load and store instructions and to name, share, and persist their data. We built Hotpot, a kernel-level DSPM system that provides low-latency, transparent memory accesses, data persistence, data reliability, and high availability. The key ideas of Hotpot are to integrate distributed memory caching and data replication techniques and to exploit application hints. We implemented Hotpot in the Linux kernel and demonstrated its benefits by building a distributed graph engine on Hotpot and porting a NoSQL database to Hotpot. Our evaluation shows that Hotpot outperforms a recent distributed shared memory system by 1.3× to 3.2× and a recent distributed PM-based file system by 1.5× to 3.0×.

下一代非易失性存储器(nvm)将提供字节可寻址性、持久性、高密度和类似dram的性能。它们有可能使许多数据中心应用程序受益。然而，之前大多数关于nvm的研究都集中在单机环境中使用它们。目前还不清楚如何在分布式数据中心环境中最好地利用它们。我们介绍了分布式共享持久内存(DSPM)，这是一种在分布式数据中心环境中使用持久内存的新框架。DSPM提供了一种新的抽象，允许应用程序执行传统的内存加载和存储指令，并命名、共享和持久化它们的数据。我们构建了Hotpot，一个内核级的DSPM系统，它提供了低延迟、透明的内存访问、数据持久性、数据可靠性和高可用性。Hotpot的核心思想是集成分布式内存缓存和数据复制技术，并利用应用程序提示。我们在Linux内核中实现了火锅，并通过在火锅上构建分布式图引擎和移植NoSQL数据库来展示它的优势。我们的评估表明，Hotpot的性能比最近的分布式共享内存系统高出1.3到3.2倍，比最近的基于分布式pm的文件系统高出1.5到3.0倍。

{"title":"Distributed shared persistent memory","authors":"Yizhou Shan, Shin-Yeh Tsai, Yiying Zhang","doi":"10.1145/3127479.3128610","DOIUrl":"https://doi.org/10.1145/3127479.3128610","url":null,"abstract":"Next-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance. They have the potential to benefit many datacenter applications. However, most previous research on NVMs has focused on using them in a single machine environment. It is still unclear how to best utilize them in distributed, datacenter environments. We introduce Distributed Shared Persistent Memory (DSPM), a new framework for using persistent memories in distributed data-center environments. DSPM provides a new abstraction that allows applications to both perform traditional memory load and store instructions and to name, share, and persist their data. We built Hotpot, a kernel-level DSPM system that provides low-latency, transparent memory accesses, data persistence, data reliability, and high availability. The key ideas of Hotpot are to integrate distributed memory caching and data replication techniques and to exploit application hints. We implemented Hotpot in the Linux kernel and demonstrated its benefits by building a distributed graph engine on Hotpot and porting a NoSQL database to Hotpot. Our evaluation shows that Hotpot outperforms a recent distributed shared memory system by 1.3× to 3.2× and a recent distributed PM-based file system by 1.5× to 3.0×.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73602956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 105

Architectural implications on the performance and cost of graph analytics systems 图形分析系统的性能和成本的架构含义

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3128606

Qizhen Zhang, Hongzhi Chen, D. Yan, James Cheng, B. T. Loo, P. Bangalore

Graph analytics systems have gained significant popularity due to the prevalence of graph data. Many of these systems are designed to run in a shared-nothing architecture whereby a cluster of machines can process a large graph in parallel. In more recent proposals, others have argued that a single-machine system can achieve better performance and/or is more cost-effective. There is however no clear consensus which approach is better. In this paper, we classify existing graph analytics systems into four categories based on the architectural differences, i.e., processing infrastructure (centralized vs distributed), and memory consumption (in-memory vs out-of-core). We select eight open-source systems to cover all categories, and perform a comparative measurement study to compare their performance and cost characteristics across a spectrum of input data, applications, and hardware settings. Our results show that the best performing configuration can depend on the type of applications and input graphs, and there is no dominant winner across all categories. Based on our findings, we summarize the trends in performance and cost, and provide several insights that help to illuminate the performance and resource cost tradeoffs across different graph analytics systems and categories.

由于图形数据的流行，图形分析系统已经获得了显著的普及。许多这样的系统被设计成在无共享架构中运行，这样一组机器可以并行处理一个大的图。在最近的建议中，其他人认为单机系统可以获得更好的性能和/或更具成本效益。然而，对于哪种方法更好并没有明确的共识。在本文中，我们根据架构差异将现有的图形分析系统分为四类，即处理基础设施(集中式vs分布式)和内存消耗(内存内vs外核)。我们选择了八个涵盖所有类别的开源系统，并执行了一项比较测量研究，以在输入数据、应用程序和硬件设置的范围内比较它们的性能和成本特征。我们的结果表明，最佳性能配置取决于应用程序和输入图的类型，并且在所有类别中没有主导的赢家。根据我们的发现，我们总结了性能和成本的趋势，并提供了一些见解，有助于阐明不同图形分析系统和类别之间的性能和资源成本权衡。

{"title":"Architectural implications on the performance and cost of graph analytics systems","authors":"Qizhen Zhang, Hongzhi Chen, D. Yan, James Cheng, B. T. Loo, P. Bangalore","doi":"10.1145/3127479.3128606","DOIUrl":"https://doi.org/10.1145/3127479.3128606","url":null,"abstract":"Graph analytics systems have gained significant popularity due to the prevalence of graph data. Many of these systems are designed to run in a shared-nothing architecture whereby a cluster of machines can process a large graph in parallel. In more recent proposals, others have argued that a single-machine system can achieve better performance and/or is more cost-effective. There is however no clear consensus which approach is better. In this paper, we classify existing graph analytics systems into four categories based on the architectural differences, i.e., processing infrastructure (centralized vs distributed), and memory consumption (in-memory vs out-of-core). We select eight open-source systems to cover all categories, and perform a comparative measurement study to compare their performance and cost characteristics across a spectrum of input data, applications, and hardware settings. Our results show that the best performing configuration can depend on the type of applications and input graphs, and there is no dominant winner across all categories. Based on our findings, we summarize the trends in performance and cost, and provide several insights that help to illuminate the performance and resource cost tradeoffs across different graph analytics systems and categories.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90180627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Prometheus: online estimation of optimal memory demands for workers in in-memory distributed computation Prometheus:在线估计内存分布式计算中工作人员的最佳内存需求

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132689

Guoyao Xu, Chengzhong Xu

Modern in-memory distributed computation frameworks like Spark adequately leverage memory resources to cache intermediate data across multi-stage tasks in pre-allocated worker processes, so as to speedup executions. They rely on a cluster resource manager like Yarn or Mesos to pre-reserve specific amount of CPU and memory for workers ahead of task scheduling. Since a worker is executed for an entire application and runs multiple batches of DAG tasks from multi-stages, its memory demands change over time [3]. Resource managers like Yarn solve the non-trivial allocation problem of determining right amounts of memory provision for workers by requiring users to make explicit reservations before execution. Since the underlying execution frameworks, workload and complex codebases are invisible, users tend to over-estimate or under-estimate workers' demands, leading to over-provisioning or under-provisioning of memory resources. We observed there exists a performance inflection point with respect to memory reservation per stage of applications. After that, performance fluctuates little even under over-provisioned memory [1]. It is the minimum required memory to achieve expected nearly optimal performance. We call these capacities as optimal demands. They are capacity cut lines to divide over-provisioning and under-provisioning. To relieve the burden of users, and provide guarantees over both maximum cluster memory utilization and optimal application performance, we present a system namely Prometheus for online estimation of optimal memory demand for workers per future stage, without involving users' efforts. The procedure to explore optimal demands is essentially a search problem correlated memory reservation and performance. Most existing searching methods [2] need multiple profiling runs or prior historical execution statistics, which are not applicable to online configuration of newly submitted or non-recurring jobs. The recurring applications' optimal demands also change over time under variations of input datasets, algorithmic parameters or source code. It becomes too expensive and infeasible to rebuild new search model for every setting. Prometheus adopts a two-step approach to tackle the problem: 1) For newly submitted or non-recurring jobs, we do profiling and histogram frequency analysis of job's runtime memory footprints from only one pilot run under over-provisioned memory. It achieves a highly accurate (over 80% accuracy) initial estimation of optimal demands per stage for each worker. By analyzing frequency of past memory usages per sampling time, we efficiently estimate probability of base demands and distinguish them from unnecessarily excessive usages. Allocation of base demands tends to achieve near-optimal performance, so as to approach optimal demands. 2) Histogram frequency analysis algorithm has an intrinsic property of self-decay. For subsequent recurring submissions, Prometheus exploits this property to efficiently perform a recur

像Spark这样的现代内存分布式计算框架充分利用内存资源，在预分配的工作进程中缓存跨多阶段任务的中间数据，从而加快执行速度。它们依赖于像Yarn或Mesos这样的集群资源管理器，在任务调度之前为工人预先预留特定数量的CPU和内存。由于工作线程是为整个应用程序执行的，并且在多个阶段运行多批DAG任务，因此其内存需求会随着时间的推移而变化[3]。像Yarn这样的资源管理器通过要求用户在执行前进行显式的预留，解决了为worker确定适当内存供应的分配问题。由于底层的执行框架、工作负载和复杂的代码库是不可见的，用户倾向于高估或低估工人的需求，从而导致内存资源的过度供应或不足供应。我们观察到，在应用程序的每个阶段的内存保留方面存在一个性能拐点。在此之后，即使在内存过度配置的情况下，性能波动也很小[1]。这是实现预期的近乎最佳性能所需的最小内存。我们把这些能力称为最优需求。它们是划分供应过剩和供应不足的容量分割线。为了减轻用户的负担，并提供最大集群内存利用率和最佳应用程序性能的保证，我们提出了一个名为Prometheus的系统，用于在线估计每个未来阶段工作人员的最佳内存需求，而不涉及用户的努力。探索最优需求的过程本质上是一个与内存保留和性能相关的搜索问题。大多数现有的搜索方法[2]需要多次分析运行或先前的历史执行统计数据，这不适用于新提交或非循环作业的在线配置。在输入数据集、算法参数或源代码的变化下，反复出现的应用程序的最佳需求也会随着时间的推移而变化。为每个设置重新构建新的搜索模型变得过于昂贵和不可行的。Prometheus采用两步方法来解决这个问题:1)对于新提交的或非重复出现的作业，我们对作业的运行时内存占用进行分析和直方图频率分析，这些作业只在内存过度供应的情况下进行一次试验运行。它实现了对每个工人每个阶段的最佳需求的高度精确(超过80%的精度)的初始估计。通过分析每个采样时间过去内存使用的频率，我们有效地估计基本需求的概率，并将它们与不必要的过度使用区分开来。基本需求的分配趋向于达到接近最优的性能，从而接近最优需求。2)直方图频率分析算法具有固有的自衰减特性。对于后续的重复提交，Prometheus利用此属性有效地执行递归搜索。它通过少量的重复执行，逐步细化并迅速达到最优需求。我们证明，与随机搜索等替代解决方案相比，这种递归搜索的搜索开销降低了3- 4倍，准确率提高了2- 4倍。我们通过在Spark和Yarn之上实现Prometheus来验证设计。实验结果表明，该方法的最终准确率达到92%以上。通过部署Prometheus并根据最佳需求保留内存，可以将集群内存利用率提高约40%。与最先进的方法相比，它同时将单个应用程序的执行时间减少了35%以上。总的来说，Prometheus提供的最优内存需求知识使集群管理器能够有效地避免内存资源的过量供应或不足供应，实现最优的应用程序性能和最大的资源效率。

{"title":"Prometheus: online estimation of optimal memory demands for workers in in-memory distributed computation","authors":"Guoyao Xu, Chengzhong Xu","doi":"10.1145/3127479.3132689","DOIUrl":"https://doi.org/10.1145/3127479.3132689","url":null,"abstract":"Modern in-memory distributed computation frameworks like Spark adequately leverage memory resources to cache intermediate data across multi-stage tasks in pre-allocated worker processes, so as to speedup executions. They rely on a cluster resource manager like Yarn or Mesos to pre-reserve specific amount of CPU and memory for workers ahead of task scheduling. Since a worker is executed for an entire application and runs multiple batches of DAG tasks from multi-stages, its memory demands change over time [3]. Resource managers like Yarn solve the non-trivial allocation problem of determining right amounts of memory provision for workers by requiring users to make explicit reservations before execution. Since the underlying execution frameworks, workload and complex codebases are invisible, users tend to over-estimate or under-estimate workers' demands, leading to over-provisioning or under-provisioning of memory resources. We observed there exists a performance inflection point with respect to memory reservation per stage of applications. After that, performance fluctuates little even under over-provisioned memory [1]. It is the minimum required memory to achieve expected nearly optimal performance. We call these capacities as optimal demands. They are capacity cut lines to divide over-provisioning and under-provisioning. To relieve the burden of users, and provide guarantees over both maximum cluster memory utilization and optimal application performance, we present a system namely Prometheus for online estimation of optimal memory demand for workers per future stage, without involving users' efforts. The procedure to explore optimal demands is essentially a search problem correlated memory reservation and performance. Most existing searching methods [2] need multiple profiling runs or prior historical execution statistics, which are not applicable to online configuration of newly submitted or non-recurring jobs. The recurring applications' optimal demands also change over time under variations of input datasets, algorithmic parameters or source code. It becomes too expensive and infeasible to rebuild new search model for every setting. Prometheus adopts a two-step approach to tackle the problem: 1) For newly submitted or non-recurring jobs, we do profiling and histogram frequency analysis of job's runtime memory footprints from only one pilot run under over-provisioned memory. It achieves a highly accurate (over 80% accuracy) initial estimation of optimal demands per stage for each worker. By analyzing frequency of past memory usages per sampling time, we efficiently estimate probability of base demands and distinguish them from unnecessarily excessive usages. Allocation of base demands tends to achieve near-optimal performance, so as to approach optimal demands. 2) Histogram frequency analysis algorithm has an intrinsic property of self-decay. For subsequent recurring submissions, Prometheus exploits this property to efficiently perform a recur","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"110 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77096187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

APUS: fast and scalable paxos on RDMA APUS:基于RDMA的快速可扩展paxos

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3128609

Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, Heming Cui

State machine replication (SMR) uses Paxos to enforce the same inputs for a program (e.g., Redis) replicated on a number of hosts, tolerating various types of failures. Unfortunately, traditional Paxos protocols incur prohibitive performance overhead on server programs due to their high consensus latency on TCP/IP. Worse, the consensus latency of extant Paxos protocols increases drastically when more concurrent client connections or hosts are added. This paper presents APUS, the first RDMA-based Paxos protocol that aims to be fast and scalable to client connections and hosts. APUS intercepts inbound socket calls of an unmodified server program, assigns a total order for all input requests, and uses fast RDMA primitives to replicate these requests concurrently. We evaluated APUS on nine widely-used server programs (e.g., Redis and MySQL). APUS incurred a mean overhead of 4.3% in response time and 4.2% in throughput. We integrated APUS with an SMR system Calvin. Our Calvin-APUS integration was 8.2X faster than the extant Calvin-ZooKeeper integration. The consensus latency of APUS outperformed an RDMA-based consensus protocol by 4.9X. APUS source code and raw results are released on github.com/hku-systems/apus.

状态机复制(SMR)使用Paxos为在多个主机上复制的程序(例如Redis)强制相同的输入，以容忍各种类型的故障。不幸的是，由于传统的Paxos协议在TCP/IP上的高共识延迟，会在服务器程序上产生令人望而却步的性能开销。更糟糕的是，当增加更多的并发客户端连接或主机时，现有Paxos协议的共识延迟会急剧增加。本文介绍了APUS，这是第一个基于rdma的Paxos协议，旨在快速并可扩展到客户端连接和主机。APUS拦截未修改的服务器程序的入站套接字调用，为所有输入请求分配总顺序，并使用快速RDMA原语并发地复制这些请求。我们在九个广泛使用的服务器程序(例如，Redis和MySQL)上评估了APUS。APUS在响应时间和吞吐量方面的平均开销分别为4.3%和4.2%。我们把APUS和SMR系统集成在一起，卡尔文。我们的Calvin-APUS集成比现有的Calvin-ZooKeeper集成快8.2倍。APUS的共识延迟比基于rdma的共识协议高出4.9X。APUS源代码和原始结果在github.com/hku-systems/apus上发布。

{"title":"APUS: fast and scalable paxos on RDMA","authors":"Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, Heming Cui","doi":"10.1145/3127479.3128609","DOIUrl":"https://doi.org/10.1145/3127479.3128609","url":null,"abstract":"State machine replication (SMR) uses Paxos to enforce the same inputs for a program (e.g., Redis) replicated on a number of hosts, tolerating various types of failures. Unfortunately, traditional Paxos protocols incur prohibitive performance overhead on server programs due to their high consensus latency on TCP/IP. Worse, the consensus latency of extant Paxos protocols increases drastically when more concurrent client connections or hosts are added. This paper presents APUS, the first RDMA-based Paxos protocol that aims to be fast and scalable to client connections and hosts. APUS intercepts inbound socket calls of an unmodified server program, assigns a total order for all input requests, and uses fast RDMA primitives to replicate these requests concurrently. We evaluated APUS on nine widely-used server programs (e.g., Redis and MySQL). APUS incurred a mean overhead of 4.3% in response time and 4.2% in throughput. We integrated APUS with an SMR system Calvin. Our Calvin-APUS integration was 8.2X faster than the extant Calvin-ZooKeeper integration. The consensus latency of APUS outperformed an RDMA-based consensus protocol by 4.9X. APUS source code and raw results are released on github.com/hku-systems/apus.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89181489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 85

Abstract: cache management and load balancing for 5G cloud radio access networks 摘要:5G云无线接入网络的缓存管理与负载均衡

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3132690

Chin Tsai, M. Moh

Cloud Radio Access Networks (CRAN) has been proposed for 5G networks for better flexibility, scalability, and performance. CRAN aims to apply cloud-like architecture on RAN. This paper focuses on cache management and load balance in the software architecture of CRAN, intends to provide fault mitigation and carrier-grade real-time services. First, a new cache management algorithm based on event frequency and QoS level of User Equipment (UE) is proposed, utilizing Exponential Decay scoring function and Analytic Hierarchy Process. Next, load balance (LB) function is added, and five LB algorithms are evaluated. The performance evaluation is based on real-life UE event characteristics and RAN system values provided by Nokia Research. Results show the cache hit rate, delay and network traffic; as well as the queue-length analysis of LB algorithms.

云无线接入网(CRAN)已被提议用于5G网络，以获得更好的灵活性、可扩展性和性能。CRAN的目标是在RAN上应用类似云的架构。本文重点研究了CRAN软件架构中的缓存管理和负载均衡，旨在提供故障缓解和电信级实时服务。首先，利用指数衰减评分函数和层次分析法，提出了一种基于事件频率和用户设备QoS级别的缓存管理算法。接下来，增加负载均衡(load balance, LB)功能，并对5种负载均衡算法进行了评估。性能评估是基于真实的UE事件特征和诺基亚研究提供的RAN系统值。结果显示了缓存命中率、延迟和网络流量;以及LB算法的队列长度分析。

引用次数: 9

An experimental comparison of complex object implementations for big data systems 大数据系统中复杂对象实现的实验比较

Proceedings of the 2017 Symposium on Cloud Computing

Pub Date : 2017-09-24 DOI: 10.1145/3127479.3129248

Sourav Sikdar, Kia Teymourian, C. Jermaine

Many cloud-based data management and analytics systems support complex objects. Dataflow platforms such as Spark and Flink allow programmers to manipulate sets consisting of objects from a host programming language (often Java). Document databases such as MongoDB make use of hierarchical interchange formats---most popularly JSON---which embody a data model where individual records can themselves contain sets of records. Systems such as Dremel and AsterixDB allow complex nesting of data structures. Clearly, no system designer would expect a system that stores JSON objects as text to perform at the same level as a system based upon a custom-built physical data model. The question we ask is: How significant is the performance hit associated with choosing a particular physical implementation? Is the choice going to result in a negligible performance cost, or one that is debilitating? Unfortunately, there does not exist a scientific study of the effect of physical complex model implementation on system performance in the literature. Hence it is difficult for a system designer to fully understand performance implications of such choices. This paper is an attempt to remedy that.

许多基于云的数据管理和分析系统支持复杂对象。像Spark和Flink这样的数据流平台允许程序员操作由宿主编程语言(通常是Java)的对象组成的集合。MongoDB等文档数据库使用分层交换格式(最流行的是JSON)，它体现了一种数据模型，其中单个记录本身可以包含记录集。像Dremel和AsterixDB这样的系统允许复杂的数据结构嵌套。显然，没有系统设计人员会期望将JSON对象存储为文本的系统与基于自定义构建的物理数据模型的系统执行相同的级别。我们要问的问题是:选择特定的物理实现对性能的影响有多大?这种选择会导致可以忽略不计的性能成本，还是会削弱性能成本?遗憾的是，目前文献中还没有关于物理复杂模型实现对系统性能影响的科学研究。因此，系统设计师很难完全理解这些选择对性能的影响。本文试图弥补这一点。

{"title":"An experimental comparison of complex object implementations for big data systems","authors":"Sourav Sikdar, Kia Teymourian, C. Jermaine","doi":"10.1145/3127479.3129248","DOIUrl":"https://doi.org/10.1145/3127479.3129248","url":null,"abstract":"Many cloud-based data management and analytics systems support complex objects. Dataflow platforms such as Spark and Flink allow programmers to manipulate sets consisting of objects from a host programming language (often Java). Document databases such as MongoDB make use of hierarchical interchange formats---most popularly JSON---which embody a data model where individual records can themselves contain sets of records. Systems such as Dremel and AsterixDB allow complex nesting of data structures. Clearly, no system designer would expect a system that stores JSON objects as text to perform at the same level as a system based upon a custom-built physical data model. The question we ask is: How significant is the performance hit associated with choosing a particular physical implementation? Is the choice going to result in a negligible performance cost, or one that is debilitating? Unfortunately, there does not exist a scientific study of the effect of physical complex model implementation on system performance in the literature. Hence it is difficult for a system designer to fully understand performance implications of such choices. This paper is an attempt to remedy that.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84148719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5