2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献_第10页

Shedding Light on Lithium/Air Batteries Using Millions of Threads on the BG/Q Supercomputer 利用BG/Q超级计算机上的数百万线程揭示锂/空气电池

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.81

V. Weber, C. Bekas, T. Laino, A. Curioni, A. Bertsch, S. Futral

In this work, we present a novel parallelization scheme for a highly efficient evaluation of the Hartree-Fock exact exchange (HFX) in ab initio molecular dynamics simulations, specifically tailored for condensed phase simulations. Our developments allow one to achieve the necessary accuracy for the evaluation of the HFX in a highly controllable manner. We show here that our solutions can take great advantage of the latest trends in HPC platforms, such as extreme threading, short vector instructions and highly dimensional interconnection networks. Indeed, all these trends are evident in the IBM Blue Gene/Q supercomputer. We demonstrate an unprecedented scalability up to 6,291,456 threads (96 BG/Q racks) with a near perfect parallel efficiency, which represents a more than 20-fold improvement as compared to the current state of the art. In terms of reduction of time to solution, we achieved an improvement that can surpass a 10-fold decrease in runtime with respect to directly comparable approaches. We exploit this development to enhance the accuracy of DFT based molecular dynamics by using the PBE0 hybrid functional. This approach allowed us to investigate the chemical behavior of organic solvents in one of the most challenging research topics in energy storage, lithium/air batteries, and to propose alternative solvents with enhanced stability to ensure an appropriate reversible electrochemical reaction. This step is key for the development of a viable lithium/air storage technology, which would have been a daunting computational task using standard methods. Recent research has shown that the electrolyte plays a key role in non-aqueous lithium/air batteries in producing the appropriate reversible electrochemical reduction. In particular, the chemical degradation of propylene carbonate, the typical electrolyte used, by lithium peroxide has been demonstrated by molecular dynamics simulations of highly realistic models. Reaching the necessary high accuracy in these simulations is a daunting computational task using standard methods.

在这项工作中，我们提出了一种新的并行方案，用于从头算分子动力学模拟中Hartree-Fock精确交换(HFX)的高效评估，专门为缩合相模拟量身定制。我们的发展使人们能够以高度可控的方式实现对HFX评估的必要准确性。我们在这里展示了我们的解决方案可以充分利用高性能计算平台的最新趋势，如极端线程、短向量指令和高维互连网络。事实上，所有这些趋势在IBM的蓝色基因/Q超级计算机上都很明显。我们展示了前所未有的可扩展性，最多可达6,291,456个线程(96个BG/Q机架)，并行效率接近完美，与目前的技术水平相比，这代表了20倍以上的改进。在减少解决方案的时间方面，我们实现了一个改进，与直接可比较的方法相比，可以在运行时减少10倍以上。我们利用这一发展，通过使用PBE0混合泛函来提高基于DFT的分子动力学的准确性。这种方法使我们能够研究有机溶剂在锂/空气电池中最具挑战性的研究课题之一中的化学行为，并提出具有增强稳定性的替代溶剂，以确保适当的可逆电化学反应。这一步是开发可行的锂/空气存储技术的关键，如果使用标准方法，这将是一项艰巨的计算任务。近年来的研究表明，在非水锂/空气电池中，电解质在产生适当的可逆电化学还原方面起着关键作用。特别是，过氧化锂对典型电解质碳酸丙烯酯的化学降解已经通过高度真实模型的分子动力学模拟得到了证明。使用标准方法在这些模拟中达到必要的高精度是一项艰巨的计算任务。

{"title":"Shedding Light on Lithium/Air Batteries Using Millions of Threads on the BG/Q Supercomputer","authors":"V. Weber, C. Bekas, T. Laino, A. Curioni, A. Bertsch, S. Futral","doi":"10.1109/IPDPS.2014.81","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.81","url":null,"abstract":"In this work, we present a novel parallelization scheme for a highly efficient evaluation of the Hartree-Fock exact exchange (HFX) in ab initio molecular dynamics simulations, specifically tailored for condensed phase simulations. Our developments allow one to achieve the necessary accuracy for the evaluation of the HFX in a highly controllable manner. We show here that our solutions can take great advantage of the latest trends in HPC platforms, such as extreme threading, short vector instructions and highly dimensional interconnection networks. Indeed, all these trends are evident in the IBM Blue Gene/Q supercomputer. We demonstrate an unprecedented scalability up to 6,291,456 threads (96 BG/Q racks) with a near perfect parallel efficiency, which represents a more than 20-fold improvement as compared to the current state of the art. In terms of reduction of time to solution, we achieved an improvement that can surpass a 10-fold decrease in runtime with respect to directly comparable approaches. We exploit this development to enhance the accuracy of DFT based molecular dynamics by using the PBE0 hybrid functional. This approach allowed us to investigate the chemical behavior of organic solvents in one of the most challenging research topics in energy storage, lithium/air batteries, and to propose alternative solvents with enhanced stability to ensure an appropriate reversible electrochemical reaction. This step is key for the development of a viable lithium/air storage technology, which would have been a daunting computational task using standard methods. Recent research has shown that the electrolyte plays a key role in non-aqueous lithium/air batteries in producing the appropriate reversible electrochemical reduction. In particular, the chemical degradation of propylene carbonate, the typical electrolyte used, by lithium peroxide has been demonstrated by molecular dynamics simulations of highly realistic models. Reaching the necessary high accuracy in these simulations is a daunting computational task using standard methods.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"235 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130792369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Characterization and Optimization of Memory-Resident MapReduce on HPC Systems 高性能计算系统中驻留内存MapReduce的特性与优化

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.87

Yandong Wang, R. Goldstone, Weikuan Yu, Teng Wang

MapReduce is a widely accepted framework for addressing big data challenges. Recently, it has also gained broad attention from scientists at the U.S. leadership computing facilities as a promising solution to process gigantic simulation results. However, conventional high-end computing systems are constructed based on the compute-centric paradigm while big data analytics applications prefer a data-centric paradigm such as MapReduce. This work characterizes the performance impact of key differences between compute- and data-centric paradigms and then provides optimizations to enable a dual-purpose HPC system that can efficiently support conventional HPC applications and new data analytics applications. Using a state-of-the-art MapReduce implementation Spark and the Hyperion system at Lawrence Livermore National Laboratory, we have examined the impact of storage architectures, data locality and task scheduling to the memory-resident MapReduce jobs. Based on our characterization and findings of the performance behaviors, we have introduced two optimization techniques, namely Enhanced Load Balancer and Congestion-Aware Task Dispatching, to improve the performance of Spark applications.

MapReduce是一个被广泛接受的解决大数据挑战的框架。最近，它也得到了美国领先计算设施科学家的广泛关注，因为它是处理巨大模拟结果的有前途的解决方案。然而，传统的高端计算系统是基于以计算为中心的范式构建的，而大数据分析应用更喜欢以数据为中心的范式，比如MapReduce。这项工作描述了以计算为中心和以数据为中心的范式之间的关键差异对性能的影响，然后提供了优化，使双重用途的HPC系统能够有效地支持传统的HPC应用程序和新的数据分析应用程序。使用最先进的MapReduce实现Spark和Lawrence Livermore国家实验室的Hyperion系统，我们研究了存储架构、数据位置和任务调度对内存驻留MapReduce作业的影响。基于我们对性能行为的描述和发现，我们引入了两种优化技术，即Enhanced Load Balancer和拥塞感知任务调度，以提高Spark应用程序的性能。

{"title":"Characterization and Optimization of Memory-Resident MapReduce on HPC Systems","authors":"Yandong Wang, R. Goldstone, Weikuan Yu, Teng Wang","doi":"10.1109/IPDPS.2014.87","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.87","url":null,"abstract":"MapReduce is a widely accepted framework for addressing big data challenges. Recently, it has also gained broad attention from scientists at the U.S. leadership computing facilities as a promising solution to process gigantic simulation results. However, conventional high-end computing systems are constructed based on the compute-centric paradigm while big data analytics applications prefer a data-centric paradigm such as MapReduce. This work characterizes the performance impact of key differences between compute- and data-centric paradigms and then provides optimizations to enable a dual-purpose HPC system that can efficiently support conventional HPC applications and new data analytics applications. Using a state-of-the-art MapReduce implementation Spark and the Hyperion system at Lawrence Livermore National Laboratory, we have examined the impact of storage architectures, data locality and task scheduling to the memory-resident MapReduce jobs. Based on our characterization and findings of the performance behaviors, we have introduced two optimization techniques, namely Enhanced Load Balancer and Congestion-Aware Task Dispatching, to improve the performance of Spark applications.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131553743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Anatomy of High-Performance Many-Threaded Matrix Multiplication 高性能多线程矩阵乘法解析

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.110

T. Smith, R. Geijn, M. Smelyanskiy, J. Hammond, F. V. Zee

BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel so that porting GEMM becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.

BLIS是一种用于快速实例化BLAS的新框架。我们描述了BLIS如何扩展“GotoBLAS方法”来实现矩阵乘法(GEMM)。虽然GEMM以前被实现为围绕内核的三个循环，但BLIS在内核中公开了两个额外的循环，将计算转换为BLIS微内核，因此移植GEMM就变成了针对给定体系结构定制这个微内核的问题。我们将讨论这如何促进更精细的并行性，从而大大简化GEMM的多线程，并为并行化多个循环提供额外的机会。具体来说，我们表明，随着许多核心架构的出现，如IBM PowerPC A2处理器(由Blue Gene/Q使用)和Intel Xeon Phi处理器，在内核内部和内核周围并行化，正如BLIS方法所支持的那样，不仅方便，而且对于可扩展性也是必要的。最终的实现交付了我们认为是这些体系结构中最好的开源性能，实现了令人印象深刻的性能和出色的可伸缩性。

{"title":"Anatomy of High-Performance Many-Threaded Matrix Multiplication","authors":"T. Smith, R. Geijn, M. Smelyanskiy, J. Hammond, F. V. Zee","doi":"10.1109/IPDPS.2014.110","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.110","url":null,"abstract":"BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the \"GotoBLAS approach\" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel so that porting GEMM becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134081067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 122

Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems 大规模并行系统的可扩展单源最短路径算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.96

Venkatesan T. Chakaravarthy, Fabio Checconi, F. Petrini, Yogish Sabharwal

In the single-source shortest path (SSSP) problem, we have to find the shortest paths from a source vertex v to all other vertices in a graph. In this paper, we introduce a novel parallel algorithm, derived from the Bellman-Ford and Delta-stepping algorithms. We employ various pruning techniques, such as edge classification and direction-optimization, to dramatically reduce inter-node communication traffic, and we propose load balancing strategies to handle higher-degree vertices. The extensive performance analysis shows that our algorithms work well on scale-free and real-world graphs. In the largest tested configuration, an R-MAT graph with 238 vertices and 242 edges on 32,768 Blue Gene/Q nodes, we have achieved a processing rate of three Trillion Edges Per Second (TTEPS), a four orders of magnitude improvement over the best published results.

在单源最短路径(SSSP)问题中，我们必须找到从源顶点v到图中所有其他顶点的最短路径。在本文中，我们介绍了一种新的并行算法，它由Bellman-Ford算法和Delta-stepping算法衍生而来。我们采用各种修剪技术，如边缘分类和方向优化，以显着减少节点间通信流量，并提出负载均衡策略来处理更高度的顶点。广泛的性能分析表明，我们的算法在无标度和真实世界的图形上工作得很好。在最大的测试配置中，一个在32,768个Blue Gene/Q节点上具有238个顶点和242条边的R-MAT图，我们已经实现了每秒3万亿边(TTEPS)的处理速率，比最佳公布的结果提高了四个数量级。

引用次数: 53

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect 松弛序互连上单边和双边通信范式的评价

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.116

K. Ibrahim, Paul H. Hargrove, Costin Iancu, K. Yelick

The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism, 2) message ordering constraints, 3) per node and per core message concurrency. In addition to using Cray native communication APIs, we use UPC and MPI micro-benchmarks to capture one- and two-sided semantics respectively. Our results indicate that relaxing the message delivery order can improve performance up to 4.6x when compared with strict ordering. When hardware allows it, high-level one-sided programming models can already take advantage of message reordering. Enforcing the ordering semantics of two-sided communication comes with a performance penalty. Furthermore, we argue that exposing out-of-order delivery at the application level is required for the next-generation programming models. Any ordering constraints in the language specifications reduce communication performance for small messages and increase the number of active cores required for peak throughput.

Cray Gemini互连硬件提供多种传输机制和无序消息传递，以提高通信吞吐量。在本文中，我们从以下几个方面量化了单边和双边通信范式的性能:1)最佳可用硬件传输机制，2)消息排序约束，3)每个节点和每个核心消息并发性。除了使用Cray本地通信api外，我们还使用UPC和MPI微基准分别捕获单边和双边语义。我们的研究结果表明，与严格的消息传递顺序相比，放宽消息传递顺序可以将性能提高4.6倍。在硬件允许的情况下，高级单面编程模型已经可以利用消息重排序。强制执行双边通信的排序语义会带来性能损失。此外，我们认为下一代编程模型需要在应用程序级别公开无序交付。语言规范中的任何排序约束都会降低小消息的通信性能，并增加峰值吞吐量所需的活动核数量。

{"title":"An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect","authors":"K. Ibrahim, Paul H. Hargrove, Costin Iancu, K. Yelick","doi":"10.1109/IPDPS.2014.116","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.116","url":null,"abstract":"The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism, 2) message ordering constraints, 3) per node and per core message concurrency. In addition to using Cray native communication APIs, we use UPC and MPI micro-benchmarks to capture one- and two-sided semantics respectively. Our results indicate that relaxing the message delivery order can improve performance up to 4.6x when compared with strict ordering. When hardware allows it, high-level one-sided programming models can already take advantage of message reordering. Enforcing the ordering semantics of two-sided communication comes with a performance penalty. Furthermore, we argue that exposing out-of-order delivery at the application level is required for the next-generation programming models. Any ordering constraints in the language specifications reduce communication performance for small messages and increase the number of active cores required for peak throughput.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131365636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Fair Maximal Independent Sets 公平极大独立集

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.79

Jeremy T. Fineman, Calvin C. Newport, M. Sherr, Tonghe Wang

Finding a maximal independent set (MIS) is a classic problem in graph theory that has been widely studied in the context of distributed algorithms. Standard distributed solutions to the MIS problem focus on time complexity. In this paper, we also consider fairness. For a given MIS algorithm A and graph G, we define the inequality factor for A on G to be the largest ratio between the probabilities of the nodes joining an MIS in the graph. We say an algorithm is fair with respect to a family of graphs if it achieves a constant inequality factor for all graphs in the family. In this paper, we seek efficient and fair algorithms for common graph families. We begin by describing an algorithm that is fair and runs in O(log* n)-time in rooted trees of size n. Moving to unrooted trees, we describe a fair algorithm that runs in O(log n) time. Generalizing further to bipartite graphs, we describe a third fair algorithm that requires O(log2 n) rounds. We also show a fair algorithm for planar graphs that runs in O(log2 n) rounds, and describe an algorithm that can be run in any graph, yielding good bounds on inequality in regions that can be efficiently colored with a small number of colors. We conclude our theoretical analysis with a lower bound that identifies a graph where all MIS algorithms achieve an inequality bound in Ω(n)-eliminating the possibility of an MIS algorithm that is fair in all graphs. Finally, to motivate the need for provable fairness guarantees, we simulate both our tree algorithm and Luby's MIS algorithm [13] in a variety of different tree topologies-some synthetic and some derived from real world data. Whereas our algorithm always yield an inequality factor ≤3.25 in these simulations, Luby's algorithms yields factors as large as 168.

最大独立集(MIS)是图论中的一个经典问题，在分布式算法中得到了广泛的研究。管理信息系统问题的标准分布式解决方案侧重于时间复杂性。在本文中，我们还考虑了公平性。对于给定的MIS算法a和图G，我们将a在G上的不等式因子定义为图中加入MIS的节点的概率之间的最大比值。我们说一个算法对于图族是公平的，如果它对族中的所有图都达到一个常数不等式因子。在本文中，我们寻求对常见图族有效且公平的算法。我们首先描述一个公平的算法，在大小为n的有根树中运行时间为O(log* n)。转到无根树，我们描述一个运行时间为O(log n)的公平算法。进一步推广到二部图，我们描述了第三种公平算法，它需要O(log2 n)轮。我们还展示了一个在O(log2 n)轮内运行的平面图的公平算法，并描述了一个可以在任何图中运行的算法，在可以有效地用少量颜色着色的区域上产生良好的不等式边界。我们用一个下界来总结我们的理论分析，该下界识别了所有MIS算法在Ω(n)中达到不等式界的图-消除了MIS算法在所有图中都是公平的可能性。最后，为了激发对可证明公平性保证的需求，我们在各种不同的树拓扑中模拟了我们的树算法和Luby的MIS算法[13]——一些是合成的，一些是从现实世界的数据中派生出来的。在这些模拟中，我们的算法总是产生一个≤3.25的不等式因子，而Luby的算法产生的因子高达168。

{"title":"Fair Maximal Independent Sets","authors":"Jeremy T. Fineman, Calvin C. Newport, M. Sherr, Tonghe Wang","doi":"10.1109/IPDPS.2014.79","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.79","url":null,"abstract":"Finding a maximal independent set (MIS) is a classic problem in graph theory that has been widely studied in the context of distributed algorithms. Standard distributed solutions to the MIS problem focus on time complexity. In this paper, we also consider fairness. For a given MIS algorithm A and graph G, we define the inequality factor for A on G to be the largest ratio between the probabilities of the nodes joining an MIS in the graph. We say an algorithm is fair with respect to a family of graphs if it achieves a constant inequality factor for all graphs in the family. In this paper, we seek efficient and fair algorithms for common graph families. We begin by describing an algorithm that is fair and runs in O(log* n)-time in rooted trees of size n. Moving to unrooted trees, we describe a fair algorithm that runs in O(log n) time. Generalizing further to bipartite graphs, we describe a third fair algorithm that requires O(log2 n) rounds. We also show a fair algorithm for planar graphs that runs in O(log2 n) rounds, and describe an algorithm that can be run in any graph, yielding good bounds on inequality in regions that can be efficiently colored with a small number of colors. We conclude our theoretical analysis with a lower bound that identifies a graph where all MIS algorithms achieve an inequality bound in Ω(n)-eliminating the possibility of an MIS algorithm that is fair in all graphs. Finally, to motivate the need for provable fairness guarantees, we simulate both our tree algorithm and Luby's MIS algorithm [13] in a variety of different tree topologies-some synthetic and some derived from real world data. Whereas our algorithm always yield an inequality factor ≤3.25 in these simulations, Luby's algorithms yields factors as large as 168.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121171701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Coprocessor Sharing-Aware Scheduler for Xeon Phi-Based Compute Clusters 基于Xeon phi的计算集群的协处理器共享感知调度器

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.44

G. Coviello, S. Cadambi, S. Chakradhar

We propose a cluster scheduling technique for compute clusters with Xeon Phi coprocessors. Even though the Xeon Phi runs Linux which allows multiprocessing, cluster schedulers generally do not allow jobs to share coprocessors because sharing can cause oversubscription of coprocessor memory and thread resources. It has been shown that memory or thread oversubscription on a many core like the Phi results in job crashes or drastic performance loss. We first show that such an exclusive device allocation policy causes severe coprocessor underutilization: for typical workloads, on average only 38% of the Xeon Phi cores are busy across the cluster. Then, to improve coprocessor utilization, we propose a scheduling technique that enables safe coprocessor sharing without resource oversubscription. Jobs specify their maximum memory and thread requirements, and our scheduler packs as many jobs as possible on each coprocessor in the cluster, subject to resource limits. We solve this problem using a greedy approach at the cluster level combined with a knapsack-based algorithm for each node. Every coprocessor is modeled as a knapsack and jobs are packed into each knapsack with the goal of maximizing job concurrency, i.e., as many jobs as possible executing on each coprocessor. Given a set of jobs, we show that this strategy of packing for high concurrency is a good proxy for (i) reducing make span, without the need for users to specify job execution times and (ii) reducing coprocessor footprint, or the number of coprocessors required to finish the jobs without increasing make span. We implement the entire system as a seamless add on to Condor, a popular distributed job scheduler, and show make span and footprint reductions of more than 50% across a wide range of workloads.

提出了一种基于Xeon Phi协处理器的集群调度技术。尽管Xeon Phi运行的Linux允许多处理，但集群调度器通常不允许作业共享协处理器，因为共享可能导致协处理器内存和线程资源的过度订阅。研究表明，在像Phi这样的多核上，内存或线程过度订阅会导致作业崩溃或严重的性能损失。我们首先表明，这种排他的设备分配策略会导致严重的协处理器利用率不足:对于典型的工作负载，平均只有38%的Xeon Phi内核在整个集群中处于繁忙状态。然后，为了提高协处理器的利用率，我们提出了一种调度技术，可以实现安全的协处理器共享，而不会导致资源超支。作业指定它们的最大内存和线程需求，我们的调度器在资源限制的情况下，在集群中的每个协处理器上打包尽可能多的作业。我们在集群级别使用贪婪方法并结合每个节点的基于背包的算法来解决这个问题。每个协处理器都被建模为一个背包，作业被打包到每个背包中，目的是最大化作业并发性，即在每个协处理器上执行尽可能多的作业。给定一组作业，我们表明，这种针对高并发性的打包策略是(i)减少make span(不需要用户指定作业执行时间)和(ii)减少协处理器占用空间，或在不增加make span的情况下完成作业所需的协处理器数量的良好代理。我们将整个系统无缝地添加到Condor(一种流行的分布式作业调度器)上，并显示在各种工作负载下，make span和footprint减少了50%以上。

{"title":"A Coprocessor Sharing-Aware Scheduler for Xeon Phi-Based Compute Clusters","authors":"G. Coviello, S. Cadambi, S. Chakradhar","doi":"10.1109/IPDPS.2014.44","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.44","url":null,"abstract":"We propose a cluster scheduling technique for compute clusters with Xeon Phi coprocessors. Even though the Xeon Phi runs Linux which allows multiprocessing, cluster schedulers generally do not allow jobs to share coprocessors because sharing can cause oversubscription of coprocessor memory and thread resources. It has been shown that memory or thread oversubscription on a many core like the Phi results in job crashes or drastic performance loss. We first show that such an exclusive device allocation policy causes severe coprocessor underutilization: for typical workloads, on average only 38% of the Xeon Phi cores are busy across the cluster. Then, to improve coprocessor utilization, we propose a scheduling technique that enables safe coprocessor sharing without resource oversubscription. Jobs specify their maximum memory and thread requirements, and our scheduler packs as many jobs as possible on each coprocessor in the cluster, subject to resource limits. We solve this problem using a greedy approach at the cluster level combined with a knapsack-based algorithm for each node. Every coprocessor is modeled as a knapsack and jobs are packed into each knapsack with the goal of maximizing job concurrency, i.e., as many jobs as possible executing on each coprocessor. Given a set of jobs, we show that this strategy of packing for high concurrency is a good proxy for (i) reducing make span, without the need for users to specify job execution times and (ii) reducing coprocessor footprint, or the number of coprocessors required to finish the jobs without increasing make span. We implement the entire system as a seamless add on to Condor, a popular distributed job scheduler, and show make span and footprint reductions of more than 50% across a wide range of workloads.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122016194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

It's About Time: On Optimal Virtual Network Embeddings under Temporal Flexibilities 时间的问题:时间灵活性下的最优虚拟网络嵌入

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.14

Matthias Rost, S. Schmid, A. Feldmann

Distributed applications often require high-performance networks with strict connectivity guarantees. For instance, many cloud applications suffer from today's variations of the intra-cloud bandwidth, which leads to poor and unpredictable application performance. Accordingly, we witness a trend towards virtual networks (VNets) which can provide resource isolation. Interestingly, while the problem of where to embed a VNet is fairly well-understood today, much less is known about when to optimally allocate a VNet. This however is important, as the requirements specified for a VNet do not have to be static, but can vary over time and even include certain temporal flexibilities. This paper initiates the study of the temporal VNet embedding problem (TVNEP). We propose a continuous-time mathematical programming approach to solve the TVNEP, and present and compare different algorithms. Based on these insights, we present the CSM-Model which incorporates both symmetry and state-space reductions to significantly speed up the process of computing exact solutions to the TVNEP. Based on the CSM-Model, we derive a greedy algorithm OGA to compute fast approximate solutions. In an extensive computational evaluation, we show that despite the hardness of the TVNEP, the CSM-Model is sufficiently powerful to solve moderately sized instances to optimality within one hour and under different objective functions (such as maximizing the number of embeddable VNets). We also show that the greedy algorithm exploits flexibilities well and yields good solutions. More generally, our results suggest that already little time flexibilities can improve the overall system performance significantly.

分布式应用程序通常需要具有严格连接保证的高性能网络。例如，许多云应用程序受到当今云内带宽变化的影响，这导致应用程序性能差且不可预测。因此，我们看到了一种趋势，即虚拟网络(VNets)可以提供资源隔离。有趣的是，虽然在哪里嵌入VNet的问题在今天已经得到了很好的理解，但对于何时最佳地分配VNet却知之甚少。然而，这一点很重要，因为为VNet指定的需求不一定是静态的，而是可以随时间变化，甚至包括某些时间灵活性。本文首先研究了时态VNet嵌入问题(TVNEP)。我们提出了一种求解TVNEP的连续时间数学规划方法，并对不同的算法进行了比较。基于这些见解，我们提出了包含对称性和状态空间约简的csm模型，以显着加快计算TVNEP精确解的过程。在csm模型的基础上，提出了一种快速求解近似解的贪心算法OGA。在广泛的计算评估中，我们表明，尽管TVNEP很困难，但cms - model足够强大，可以在不同的目标函数(如最大化可嵌入vnet的数量)下，在一小时内解决中等规模的实例的最优性。结果表明，贪心算法充分利用了算法的灵活性，得到了较好的解。更一般地说，我们的结果表明，很少的时间灵活性可以显著提高整个系统的性能。

{"title":"It's About Time: On Optimal Virtual Network Embeddings under Temporal Flexibilities","authors":"Matthias Rost, S. Schmid, A. Feldmann","doi":"10.1109/IPDPS.2014.14","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.14","url":null,"abstract":"Distributed applications often require high-performance networks with strict connectivity guarantees. For instance, many cloud applications suffer from today's variations of the intra-cloud bandwidth, which leads to poor and unpredictable application performance. Accordingly, we witness a trend towards virtual networks (VNets) which can provide resource isolation. Interestingly, while the problem of where to embed a VNet is fairly well-understood today, much less is known about when to optimally allocate a VNet. This however is important, as the requirements specified for a VNet do not have to be static, but can vary over time and even include certain temporal flexibilities. This paper initiates the study of the temporal VNet embedding problem (TVNEP). We propose a continuous-time mathematical programming approach to solve the TVNEP, and present and compare different algorithms. Based on these insights, we present the CSM-Model which incorporates both symmetry and state-space reductions to significantly speed up the process of computing exact solutions to the TVNEP. Based on the CSM-Model, we derive a greedy algorithm OGA to compute fast approximate solutions. In an extensive computational evaluation, we show that despite the hardness of the TVNEP, the CSM-Model is sufficiently powerful to solve moderately sized instances to optimality within one hour and under different objective functions (such as maximizing the number of embeddable VNets). We also show that the greedy algorithm exploits flexibilities well and yields good solutions. More generally, our results suggest that already little time flexibilities can improve the overall system performance significantly.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123472117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment 使用轻量级运行时环境进行混合多gpu和多协处理器环境的统一开发

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.58

A. Haidar, Chongxiao Cao, A. YarKhan, P. Luszczek, S. Tomov, K. Kabir, J. Dongarra

Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.

现代计算机可用的许多异构资源都是为不同的工作负载设计的。为了有效地使用GPU资源，工作负载必须比为多核cpu设计的工作负载具有更高程度的并行性。从概念上讲，英特尔至强协处理器能够处理介于两者之间的工作负载。如此多的可应用工作负载可能导致在多用户环境中混合使用多核cpu、gpu和Intel协处理器，这些环境必须为广泛的工作负载提供足够的计算设施。在这项工作中，我们使用轻量级运行时环境来管理特定于资源的工作负载，并在双向混合系统中控制数据流和并行执行。轻量级运行时环境使用任务超标量概念，使开发人员能够编写串行代码，同时提供并行执行。此外，我们的任务抽象支持跨所有异构资源的统一算法开发。我们提供了密集线性代数应用程序的性能结果，证明了我们的方法的有效性和各种加速器硬件的充分利用。

{"title":"Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment","authors":"A. Haidar, Chongxiao Cao, A. YarKhan, P. Luszczek, S. Tomov, K. Kabir, J. Dongarra","doi":"10.1109/IPDPS.2014.58","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.58","url":null,"abstract":"Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126239849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters 克服蓝色水域流行病模拟的可扩展性挑战

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.83

Jae-Seung Yeom, A. Bhatele, K. Bisset, Eric J. Bohm, Abhishek K. Gupta, L. Kalé, M. Marathe, Dimitrios S. Nikolopoulos, M. Schulz, Lukasz Wesolowski

Modeling dynamical systems represents an important application class covering a wide range of disciplines including but not limited to biology, chemistry, finance, national security, and health care. Such applications typically involve large-scale, irregular graph processing, which makes them difficult to scale due to the evolutionary nature of their workload, irregular communication and load imbalance. EpiSimdemics is such an application simulating epidemic diffusion in extremely large and realistic social contact networks. It implements a graph-based system that captures dynamics among co-evolving entities. This paper presents an implementation of EpiSimdemics in Charm++ that enables future research by social, biological and computational scientists at unprecedented data and system scales. We present new methods for application-specific processing of graph data and demonstrate the effectiveness of these methods on a Cray XE6, specifically NCSA's Blue Waters system.

动态系统建模是一门重要的应用课程，涵盖了广泛的学科，包括但不限于生物、化学、金融、国家安全和卫生保健。这类应用程序通常涉及大规模、不规则的图形处理，由于其工作负载的演化性质、不规则的通信和负载不平衡，这使得它们难以扩展。episimdemic就是这样一个应用程序，它可以模拟流行病在超大的现实社会接触网络中的传播。它实现了一个基于图的系统，可以捕获共同进化实体之间的动态。本文介绍了episimdemic在Charm++中的实现，使社会、生物和计算科学家能够在前所未有的数据和系统规模上进行未来的研究。我们提出了针对特定应用程序处理图形数据的新方法，并在Cray XE6上演示了这些方法的有效性，特别是NCSA的Blue Waters系统。

引用次数: 43