Proceedings of the 48th International Conference on Parallel Processing最新文献

On Max-min Fair Resource Allocation for Distributed Job Execution 分布式作业执行的最大最小公平资源分配

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337843

Yitong Guan, Chuanyou Li, Xueyan Tang

In modern data intensive computing, it is increasingly common for jobs to be executed in a distributed fashion across multiple machine clusters or datacenters to take advantage of data locality. This paper studies fair resource allocation among jobs requiring distributed execution. We extend conventional max-min fairness for resource allocation in a single machine or machine cluster to distributed job execution over multiple sites and define Aggregate Max-min Fairness (AMF) which requires the aggregate resource allocation across all sites to be max-min fair. We show that AMF satisfies the properties of Pareto efficiency, envy-freeness and strategy-proofness, but it does not necessarily satisfy the sharing incentive property. We propose an enhanced version of AMF to guarantee the sharing incentive property. We present algorithms to compute AMF allocations and propose an add-on to optimize the job completion times under AMF. Experimental results show that compared with a baseline which simply requires the resource allocation at each site to be max-min fair, AMF performs significantly better in balancing resource allocation and in job completion time, particularly when the workload distribution of jobs among sites is highly skewed.

在现代数据密集型计算中，为了利用数据局部性，作业以分布式方式跨多个机器集群或数据中心执行的情况越来越普遍。本文研究了需要分布式执行的作业之间的资源公平分配问题。我们将单个机器或机器集群中资源分配的最大最小公平扩展到多个站点上的分布式作业执行，并定义了聚合最大最小公平(AMF)，它要求所有站点的聚合资源分配都是最大最小公平的。我们证明了AMF满足帕累托效率、嫉妒自由和策略证明的性质，但它并不一定满足共享激励的性质。我们提出了一个增强版本的AMF来保证共享激励的性质。我们提出了计算AMF分配的算法，并提出了一个附加组件来优化AMF下的作业完成时间。实验结果表明，与仅要求每个站点的资源分配最大最小公平的基准相比，AMF在平衡资源分配和任务完成时间方面表现得明显更好，特别是当任务在站点之间的工作量分布高度倾斜时。

{"title":"On Max-min Fair Resource Allocation for Distributed Job Execution","authors":"Yitong Guan, Chuanyou Li, Xueyan Tang","doi":"10.1145/3337821.3337843","DOIUrl":"https://doi.org/10.1145/3337821.3337843","url":null,"abstract":"In modern data intensive computing, it is increasingly common for jobs to be executed in a distributed fashion across multiple machine clusters or datacenters to take advantage of data locality. This paper studies fair resource allocation among jobs requiring distributed execution. We extend conventional max-min fairness for resource allocation in a single machine or machine cluster to distributed job execution over multiple sites and define Aggregate Max-min Fairness (AMF) which requires the aggregate resource allocation across all sites to be max-min fair. We show that AMF satisfies the properties of Pareto efficiency, envy-freeness and strategy-proofness, but it does not necessarily satisfy the sharing incentive property. We propose an enhanced version of AMF to guarantee the sharing incentive property. We present algorithms to compute AMF allocations and propose an add-on to optimize the job completion times under AMF. Experimental results show that compared with a baseline which simply requires the resource allocation at each site to be max-min fair, AMF performs significantly better in balancing resource allocation and in job completion time, particularly when the workload distribution of jobs among sites is highly skewed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117213089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Dynamic Load Balancing in Hybrid Switching Data Center Networks with Converters 带转换器的混合交换数据中心网络的动态负载平衡

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337898

Jiaqi Zheng, Qiming Zheng, Xiaofeng Gao, Guihai Chen

Today's data centers rely on scale-out architectures like fat-tree, BCube, VL2, etc. to connect a large number of commodity servers. It's important to balance the traffic load across the available links. Since the traditional electrical network cannot perfectly respond to the traffic variations in data centers, a growing trend is to introduce converters with adjustable optical links instead of adding more wiring links. However, little is known today about how to fully exploit the potential of the flexibility from the converters: the joint optimization on adjusting the optical links inside the converters and the routing in the whole network remains algorithmically challenging. In this paper, we initiate the study of dynamic load balancing problem (DLBP) in hybrid switching data center networks with converters. We design a set of specific converters for Diamond, VL2, BCube topologies to introduce more flexibility. Based on it, the connections of the optical links inside the converter and the route for each flow needs to be jointly optimized to minimize the maximum link utilization in the whole network. We formulate DLBP as an optimization program and prove that it's not only NP-hard, but also ρ-inapproximation. Further, we design a greedy algorithm to solve it. Extensive experiments show that our algorithm can reduce the traffic congestion by 12% on average.

今天的数据中心依赖于向外扩展的架构，如fat-tree、BCube、VL2等来连接大量的商用服务器。在可用链接之间平衡流量负载非常重要。由于传统的电力网络无法完美地响应数据中心的流量变化，因此引入具有可调节光链路的转换器而不是增加更多的布线链路是一个日益增长的趋势。然而，对于如何充分利用转换器的灵活性潜力，目前所知甚少:在调整转换器内部光链路和整个网络中的路由方面的联合优化仍然具有算法挑战性。本文研究了带转换器的混合交换数据中心网络中的动态负载平衡问题。我们为Diamond, VL2, BCube拓扑设计了一组特定的转换器，以引入更多的灵活性。在此基础上，需要对转换器内部光链路的连接和各流的路由进行联合优化，使整个网络的链路利用率最大。我们将DLBP描述为一个优化程序，并证明了它不仅是np困难的，而且是ρ不逼近的。在此基础上，设计了贪心算法求解该问题。大量的实验表明，我们的算法可以平均减少12%的交通拥堵。

{"title":"Dynamic Load Balancing in Hybrid Switching Data Center Networks with Converters","authors":"Jiaqi Zheng, Qiming Zheng, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337898","DOIUrl":"https://doi.org/10.1145/3337821.3337898","url":null,"abstract":"Today's data centers rely on scale-out architectures like fat-tree, BCube, VL2, etc. to connect a large number of commodity servers. It's important to balance the traffic load across the available links. Since the traditional electrical network cannot perfectly respond to the traffic variations in data centers, a growing trend is to introduce converters with adjustable optical links instead of adding more wiring links. However, little is known today about how to fully exploit the potential of the flexibility from the converters: the joint optimization on adjusting the optical links inside the converters and the routing in the whole network remains algorithmically challenging. In this paper, we initiate the study of dynamic load balancing problem (DLBP) in hybrid switching data center networks with converters. We design a set of specific converters for Diamond, VL2, BCube topologies to introduce more flexibility. Based on it, the connections of the optical links inside the converter and the route for each flow needs to be jointly optimized to minimize the maximum link utilization in the whole network. We formulate DLBP as an optimization program and prove that it's not only NP-hard, but also ρ-inapproximation. Further, we design a greedy algorithm to solve it. Extensive experiments show that our algorithm can reduce the traffic congestion by 12% on average.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127149837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs 可延展作业的整体减速驱动调度和资源管理

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337909

Marco D'Amico, Ana Jokanovic, J. Corbalán

In job scheduling, the concept of malleability has been explored since many years ago. Research shows that malleability improves system performance, but its utilization in HPC never became widespread. The causes are the difficulty in developing malleable applications, and the lack of support and integration of the different layers of the HPC software stack. However, in the last years, malleability in job scheduling is becoming more critical because of the increasing complexity of hardware and workloads. In this context, using nodes in an exclusive mode is not always the most efficient solution as in traditional HPC jobs, where applications were highly tuned for static allocations, but offering zero flexibility to dynamic executions. This paper proposes a new holistic, dynamic job scheduling policy, Slowdown Driven (SD-Policy), which exploits the malleability of applications as the key technology to reduce the average slowdown and response time of jobs. SD-Policy is based on backfill and node sharing. It applies malleability to running jobs to make room for jobs that will run with a reduced set of resources, only when the estimated slowdown improves over the static approach. We implemented SD-Policy in SLURM and evaluated it in a real production environment, and with a simulator using workloads of up to 198K jobs. Results show better resource utilization with the reduction of makespan, response time, slowdown, and energy consumption, up to respectively 7%, 50%, 70%, and 6%, for the evaluated workloads.

在作业调度中，延展性的概念早在许多年前就被探索出来了。研究表明，延展性提高了系统性能，但它在高性能计算中的应用从未得到广泛应用。其原因是开发可伸缩性应用程序的困难，以及缺乏对HPC软件堆栈不同层的支持和集成。然而，在过去几年中，由于硬件和工作负载的复杂性日益增加，作业调度的可伸缩性变得越来越重要。在这种情况下，在排他模式下使用节点并不总是最有效的解决方案，因为在传统的HPC作业中，应用程序对静态分配进行了高度调整，但对动态执行没有提供任何灵活性。本文提出了一种新的整体动态作业调度策略SD-Policy，该策略利用应用程序的可延展性作为降低作业平均延迟和响应时间的关键技术。SD-Policy基于回填和节点共享。它对正在运行的作业应用延展性，以便为使用减少的资源集运行的作业腾出空间，只有当估计的减速比静态方法有所改善时才这样做。我们在SLURM中实现了SD-Policy，并在真实的生产环境中对其进行了评估，并使用了高达198K个作业的工作负载模拟器。结果显示，通过减少完工时间、响应时间、减速和能耗，对评估的工作负载分别减少了7%、50%、70%和6%，从而提高了资源利用率。

{"title":"Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs","authors":"Marco D'Amico, Ana Jokanovic, J. Corbalán","doi":"10.1145/3337821.3337909","DOIUrl":"https://doi.org/10.1145/3337821.3337909","url":null,"abstract":"In job scheduling, the concept of malleability has been explored since many years ago. Research shows that malleability improves system performance, but its utilization in HPC never became widespread. The causes are the difficulty in developing malleable applications, and the lack of support and integration of the different layers of the HPC software stack. However, in the last years, malleability in job scheduling is becoming more critical because of the increasing complexity of hardware and workloads. In this context, using nodes in an exclusive mode is not always the most efficient solution as in traditional HPC jobs, where applications were highly tuned for static allocations, but offering zero flexibility to dynamic executions. This paper proposes a new holistic, dynamic job scheduling policy, Slowdown Driven (SD-Policy), which exploits the malleability of applications as the key technology to reduce the average slowdown and response time of jobs. SD-Policy is based on backfill and node sharing. It applies malleability to running jobs to make room for jobs that will run with a reduced set of resources, only when the estimated slowdown improves over the static approach. We implemented SD-Policy in SLURM and evaluated it in a real production environment, and with a simulator using workloads of up to 198K jobs. Results show better resource utilization with the reduction of makespan, response time, slowdown, and energy consumption, up to respectively 7%, 50%, 70%, and 6%, for the evaluated workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125554291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Optimized Execution of Parallel Loops via User-Defined Scheduling Policies 通过自定义调度策略优化并行循环的执行

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337913

Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar

On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.

节点上并行性对于高性能计算的重要性不断增加，大多数新部署的超级计算机每个节点都有数十个处理器内核。这些更高级别的节点上并行性加剧了并行计算中负载不平衡和局部性的影响，并且当前的编程系统明显缺乏能够有效使用这些大量内核或要求用户大量修改代码的功能。我们的工作的动机是需要解决特定于应用程序的负载平衡和局部性需求，同时对应用程序代码进行最小的更改。在本文中，我们提出了一种新的方法，通过指定迭代块的用户函数来扩展并行循环的规范。我们还扩展了运行时系统，以便在确定如何创建块并在工作线程上调度它们时调用这些用户函数。我们的运行时系统从用户函数中指定的子空间开始，并发地执行块的负载平衡，并存储平衡的块组，以减少未来调用中的负载不平衡。我们的方法可用于改善许多动态迭代应用程序的负载平衡和局部性，包括图和稀疏矩阵应用程序。我们使用从LAMMPS派生的迷你应用程序MiniMD和GAP基准测试套件中的三个内核:广度优先搜索、连接组件和PageRank来演示这项工作的好处，每个内核都使用六个不同的图数据集进行评估。我们的方法在四个标准OpenMP调度上实现了1.16到1.54倍的几何平均加速，在最近研究的static_steal调度上实现了1.07倍的几何平均加速。

{"title":"Optimized Execution of Parallel Loops via User-Defined Scheduling Policies","authors":"Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar","doi":"10.1145/3337821.3337913","DOIUrl":"https://doi.org/10.1145/3337821.3337913","url":null,"abstract":"On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129790209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Massively Parallel Automated Software Tuning 大规模并行自动化软件调优

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337908

J. Kurzak, Y. Tsai, M. Gates, A. Abdelfattah, J. Dongarra

This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.

本文介绍了分布式自动调优引擎的实现，该实现是OpenN软件自动调优基础设施项目的一部分。该系统面向图形处理单元的计算内核的性能优化，并允许在大规模并行机器上部署大量自动调整扫描。该软件实现了对分布式内存资源的动态工作调度，利用多线程进行并行编译，并将内核启动分配给多个加速器。本文阐述了该系统的主要设计原则，并讨论了初步实现的基本机制。介绍了初步的性能结果，讨论了遇到的挑战，并概述了未来的发展方向。

引用次数: 5

Stage Delay Scheduling: Speeding up DAG-style Data Analytics Jobs with Resource Interleaving 阶段延迟调度:使用资源交错加速dag风格的数据分析工作

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337872

Wujie Shao, Fei Xu, Li Chen, Haoyue Zheng, Fangming Liu

To increase the resource utilization of datacenters, big data analytics jobs are commonly running stages in parallel which are organized into and scheduled according to the Directed Acyclic Graph (DAG). Through an in-depth analysis of the latest Alibaba cluster trace and our motivation experiments on Amazon EC2, however, we show that the CPU and network resources are still under-utilized due to the unwise stage scheduling, thereby prolonging the completion time of a DAG-style job (e.g., Spark). While existing works on reducing the job completion time focus on either task scheduling or job scheduling, stage scheduling has received comparably little attention. In this paper, we design and implement DelayStage, a simple yet effective stage delay scheduling strategy to interleave the cluster resources across the parallel stages, so as to increase the cluster resource utilization and speed up the job performance. With the aim of minimizing the makespan of parallel stages, DelayStage judiciously arranges the execution of stages in a pipelined manner to maximize the performance benefits of resource interleaving. Extensive prototype experiments on 30 Amazon EC2 instances and complementary trace-driven simulations show that DelayStage can improve the cluster resource utilization by up to 81.8% and reduce the job completion time by up to 41.3%, in comparison to the stock Spark and the state-of-the-art stage scheduling strategies, yet with acceptable runtime overhead.

为了提高数据中心的资源利用率，大数据分析作业通常是并行运行的阶段，并根据有向无环图(DAG)进行组织和调度。然而，通过对最新的阿里巴巴集群跟踪和我们在Amazon EC2上的动机实验的深入分析，我们发现由于不明智的阶段调度，CPU和网络资源仍然没有得到充分利用，从而延长了dag式作业(例如Spark)的完成时间。现有的关于减少作业完成时间的研究主要集中在任务调度或作业调度上，而阶段调度却很少受到关注。本文设计并实现了一种简单有效的阶段延迟调度策略DelayStage，将集群资源穿插在并行阶段上，从而提高集群资源利用率，提高作业性能。为了最小化并行阶段的最大跨度，DelayStage明智地以流水线方式安排阶段的执行，以最大限度地提高资源交错的性能效益。在30个Amazon EC2实例上进行的大量原型实验和互补的跟踪驱动模拟表明，与stock Spark和最先进的阶段调度策略相比，DelayStage可以将集群资源利用率提高81.8%，并将作业完成时间减少41.3%，但运行时开销是可以接受的。

{"title":"Stage Delay Scheduling: Speeding up DAG-style Data Analytics Jobs with Resource Interleaving","authors":"Wujie Shao, Fei Xu, Li Chen, Haoyue Zheng, Fangming Liu","doi":"10.1145/3337821.3337872","DOIUrl":"https://doi.org/10.1145/3337821.3337872","url":null,"abstract":"To increase the resource utilization of datacenters, big data analytics jobs are commonly running stages in parallel which are organized into and scheduled according to the Directed Acyclic Graph (DAG). Through an in-depth analysis of the latest Alibaba cluster trace and our motivation experiments on Amazon EC2, however, we show that the CPU and network resources are still under-utilized due to the unwise stage scheduling, thereby prolonging the completion time of a DAG-style job (e.g., Spark). While existing works on reducing the job completion time focus on either task scheduling or job scheduling, stage scheduling has received comparably little attention. In this paper, we design and implement DelayStage, a simple yet effective stage delay scheduling strategy to interleave the cluster resources across the parallel stages, so as to increase the cluster resource utilization and speed up the job performance. With the aim of minimizing the makespan of parallel stages, DelayStage judiciously arranges the execution of stages in a pipelined manner to maximize the performance benefits of resource interleaving. Extensive prototype experiments on 30 Amazon EC2 instances and complementary trace-driven simulations show that DelayStage can improve the cluster resource utilization by up to 81.8% and reduce the job completion time by up to 41.3%, in comparison to the stock Spark and the state-of-the-art stage scheduling strategies, yet with acceptable runtime overhead.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131178369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

TLB TLB

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337866

Jinbin Hu, Jiawei Huang, Wenjun Lv, Weihe Li, Jianxin Wang, Tian He

Modern datacenter topologies typically are multi-rooted trees consisting of multiple paths between any given pair of hosts. Recent load balancing designs focus on making full use of available parallel paths to provide high bisection bandwidth. However, they are agnostic to the mixed traffic generated by diverse applications in data centers and respectively use the same granularity in rerouting flows regardless of the flow type. Therefore, the short flows suffer the long-tailed queueing delay and reordering problems, while the throughputs of long flows are also degraded dramatically due to low link utilization and packet reordering under the non-adaptive granularity. To solve these problems, we design a traffic-aware load balancing (TLB) scheme to adopt different rerouting granularities for two kinds of flows. Specifically, TLB adaptively adjusts the switching granularity of long flows according to the load strength of short ones. Under the heavy load of short flows, the long flows use large switching granularity to help short ones obtain more opportunities in choosing short queues to complete quickly. When the load strength of short flows is low, the long flows switch paths more flexibly with small switching granularity to achieve high throughput. TLB is deployed at the switch, without any modifications on the end-hosts. The experimental results of NS2 simulations and Mininet implementation show that TLB significantly reduces the average flow completion time (AFCT) of short flows by ~15%-40% over the state-of-the-art load balancing schemes and achieves the high throughput for long flows.

{"title":"TLB","authors":"Jinbin Hu, Jiawei Huang, Wenjun Lv, Weihe Li, Jianxin Wang, Tian He","doi":"10.1145/3337821.3337866","DOIUrl":"https://doi.org/10.1145/3337821.3337866","url":null,"abstract":"Modern datacenter topologies typically are multi-rooted trees consisting of multiple paths between any given pair of hosts. Recent load balancing designs focus on making full use of available parallel paths to provide high bisection bandwidth. However, they are agnostic to the mixed traffic generated by diverse applications in data centers and respectively use the same granularity in rerouting flows regardless of the flow type. Therefore, the short flows suffer the long-tailed queueing delay and reordering problems, while the throughputs of long flows are also degraded dramatically due to low link utilization and packet reordering under the non-adaptive granularity. To solve these problems, we design a traffic-aware load balancing (TLB) scheme to adopt different rerouting granularities for two kinds of flows. Specifically, TLB adaptively adjusts the switching granularity of long flows according to the load strength of short ones. Under the heavy load of short flows, the long flows use large switching granularity to help short ones obtain more opportunities in choosing short queues to complete quickly. When the load strength of short flows is low, the long flows switch paths more flexibly with small switching granularity to achieve high throughput. TLB is deployed at the switch, without any modifications on the end-hosts. The experimental results of NS2 simulations and Mininet implementation show that TLB significantly reduces the average flow completion time (AFCT) of short flows by ~15%-40% over the state-of-the-art load balancing schemes and achieves the high throughput for long flows.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124353888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Network Congestion-aware Online Service Function Chain Placement and Load Balancing 网络拥塞感知在线业务功能链布局与负载均衡

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337850

Xiaojun Shang, Zhenhua Liu, Yuanyuan Yang

Emerging virtual network functions (VNFs) introduce new flexibility and scalability into traditional middlebox. Specifically, middleboxes are virtualized as software-based platforms running on commodity servers known as network points of presence (N-PoPs). Traditional network services are therefore realized by chained VNFs, i.e., service function chains (SFCs), running on potentially multiple N-PoPs. SFCs can be flexibly placed and routed to reduce operating cost. However, excessively pursuing low cost may incur congestion on some popular N-PoPs and links, which results in performance degradation or even violation of the service level of agreements. In this paper, we first propose an optimization problem for joint SFC placement and routing. Given the problem is NP-hard, we design an approximation algorithm named candidate path selection (CPS) with a theoretical performance guarantee. We then propose an online optimization problem for placement of SFCs with fast demand fluctuation. The problem concerns migration costs of VNFs between time slots, and we design an online candidate path selection (OCPS) algorithm to handle it. Extensive simulation results highlight that the CPS and OCPS algorithms provide efficient placement and routing of SFCs comparable to the optimal solution.

新兴的虚拟网络功能(VNFs)为传统的中间件带来了新的灵活性和可扩展性。具体来说，中间件被虚拟化为运行在称为网络存在点(n - pop)的商用服务器上的基于软件的平台。因此，传统的网络业务是通过链式VNFs实现的，即业务功能链(sfc)，可能运行在多个n - pop上。SFCs可以灵活地放置和路由，以降低运营成本。然而，过度追求低成本可能会导致一些流行的n - pop和链路拥塞，从而导致性能下降甚至违反协议的服务水平。在本文中，我们首先提出了一个联合SFC布局和路由的优化问题。考虑到问题是np困难的，我们设计了一种具有理论性能保证的近似算法候选路径选择(CPS)。然后，我们提出了一个快速需求波动的sfc的在线优化问题。该问题涉及VNFs在时隙间的迁移成本，我们设计了一种在线候选路径选择(OCPS)算法来处理该问题。大量的仿真结果表明，CPS和OCPS算法提供了与最优解决方案相当的sfc的有效放置和路由。

{"title":"Network Congestion-aware Online Service Function Chain Placement and Load Balancing","authors":"Xiaojun Shang, Zhenhua Liu, Yuanyuan Yang","doi":"10.1145/3337821.3337850","DOIUrl":"https://doi.org/10.1145/3337821.3337850","url":null,"abstract":"Emerging virtual network functions (VNFs) introduce new flexibility and scalability into traditional middlebox. Specifically, middleboxes are virtualized as software-based platforms running on commodity servers known as network points of presence (N-PoPs). Traditional network services are therefore realized by chained VNFs, i.e., service function chains (SFCs), running on potentially multiple N-PoPs. SFCs can be flexibly placed and routed to reduce operating cost. However, excessively pursuing low cost may incur congestion on some popular N-PoPs and links, which results in performance degradation or even violation of the service level of agreements. In this paper, we first propose an optimization problem for joint SFC placement and routing. Given the problem is NP-hard, we design an approximation algorithm named candidate path selection (CPS) with a theoretical performance guarantee. We then propose an online optimization problem for placement of SFCs with fast demand fluctuation. The problem concerns migration costs of VNFs between time slots, and we design an online candidate path selection (OCPS) algorithm to handle it. Extensive simulation results highlight that the CPS and OCPS algorithms provide efficient placement and routing of SFCs comparable to the optimal solution.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130524842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

COMBFT COMBFT

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337885

Yingyao Rong, Weigang Wu, Zhiguang Chen

Byzantine Fault-Tolerant (BFT) state machine replication protocol is an important building block for highly available distributed computing. This paper presents COMBFT, a BFT protocol that achieves both efficiency and robustness simultaneously. The major novelty of COMBFT lies in Conflicting-Order-Match (COM), a new request ordering mechanism that uses a new way to select the available sequence number for requests, and detects the possible malicious primary early. COM assigns sequence number based on request interference, and requires both primary and backup nodes to conduct request ordering, which can greatly reduce the impact of malicious primary and clients. When the backup suspects the primary may be malicious, it triggers an efficient commit protocol with two phases (i.e., suspect phase and commit phase) to further confirm whether the primary is malicious, and commit the request. The performance of COMBFT is evaluated via simulations and the results illustrate the outstanding performance of COMBFT in terms of throughput, latency and fault scalability.

引用次数: 2

Artemis 阿耳特弥斯

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337897

Xuebing Li, Bingyang Liu, Yang Chen, Yu Xiao, Jiaxin Tang, Xin Wang

Today, Internet service deployment is typically implemented with server replication at multiple locations for the purpose of load balancing, failure tolerance, and user experience optimization. Domain name system (DNS) is responsible for translating human-readable domain names into network-routable IP addresses. When multiple replicas exist, upon the arrival of a query, DNS selects one replica and responds with its IP address. Thus, the delay caused by the process of DNS query including the selection of replica is part of the connection setup latency. In this paper, we proposed Artemis, a practical low-latency naming and routing system that aims at reducing the connection setup latency by eliminating the DNS query latency while keeping the ability to perform optimal server (replica) selection based on user-defined rules. Artemis achieves these goals by integrating name resolution into the transport layer handshake. Artemis allows clients to calculate locally the IP address of a Service Dispatcher, which serves as a proxy of hosting servers. Service Dispatchers forward the handshake request from a client to a server, and the response is embedded with the server's IP address back to the client. This enables clients to connect directly with servers afterward without querying DNS servers, and therefore eliminates the DNS query latency. Meanwhile, Artemis supports user-defined replica selection policies. We have implemented Artemis and evaluated its performance using the PlanetLab testbed and RIPE Atlas probes. Our results show that Artemis reduces the connection setup latency by 26.2% on average compared with the state-of-the-art.

{"title":"Artemis","authors":"Xuebing Li, Bingyang Liu, Yang Chen, Yu Xiao, Jiaxin Tang, Xin Wang","doi":"10.1145/3337821.3337897","DOIUrl":"https://doi.org/10.1145/3337821.3337897","url":null,"abstract":"Today, Internet service deployment is typically implemented with server replication at multiple locations for the purpose of load balancing, failure tolerance, and user experience optimization. Domain name system (DNS) is responsible for translating human-readable domain names into network-routable IP addresses. When multiple replicas exist, upon the arrival of a query, DNS selects one replica and responds with its IP address. Thus, the delay caused by the process of DNS query including the selection of replica is part of the connection setup latency. In this paper, we proposed Artemis, a practical low-latency naming and routing system that aims at reducing the connection setup latency by eliminating the DNS query latency while keeping the ability to perform optimal server (replica) selection based on user-defined rules. Artemis achieves these goals by integrating name resolution into the transport layer handshake. Artemis allows clients to calculate locally the IP address of a Service Dispatcher, which serves as a proxy of hosting servers. Service Dispatchers forward the handshake request from a client to a server, and the response is embedded with the server's IP address back to the client. This enables clients to connect directly with servers afterward without querying DNS servers, and therefore eliminates the DNS query latency. Meanwhile, Artemis supports user-defined replica selection policies. We have implemented Artemis and evaluated its performance using the PlanetLab testbed and RIPE Atlas probes. Our results show that Artemis reduces the connection setup latency by 26.2% on average compared with the state-of-the-art.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"856 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114138079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4