2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第7页

Adapting Batch Scheduling to Workload Characteristics: What Can We Expect From Online Learning? 使批调度适应工作负载特征:我们能从在线学习中得到什么?

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00077

Arnaud Legrand, D. Trystram, Salah Zrigui

Despite the impressive growth and size of super-computers, the computational power they provide still cannot match the demand. Efficient and fair resource allocation is a critical task. Super-computers use Resource and Job Management Systems to schedule applications, which is generally done by relying on generic index policies such as First Come First Served and Shortest Processing time First in combination with Backfilling strategies. Unfortunately, such generic policies often fail to exploit specific characteristics of real workloads. In this work, we focus on improving the performance of online schedulers. We study mixed policies, which are created by combining multiple job characteristics in a weighted linear expression, as opposed to classical pure policies which use only a single characteristic. This larger class of scheduling policies aims at providing more flexibility and adaptability. We use space coverage and black-box optimization techniques to explore this new space of mixed policies and we study how can they adapt to the changes in the workload. We perform an extensive experimental campaign through which we show that (1) even the best pure policy is far from optimal and that (2) using a carefully tuned mixed policy would allow to significantly improve the performance of the system. (3) We also provide empirical evidence that there is no one size fits all policy, by showing that the rapid workload evolution seems to prevent classical online learning algorithms from being effective.

尽管超级计算机的增长和规模令人印象深刻，但它们提供的计算能力仍然无法满足需求。有效和公平的资源配置是一项关键任务。超级计算机使用资源和作业管理系统来调度应用程序，这通常是通过依赖一般索引策略来完成的，例如先到先得和最短处理时间优先与回填策略相结合。不幸的是，这种通用策略通常无法利用实际工作负载的特定特征。在这项工作中，我们的重点是提高在线调度程序的性能。我们研究混合策略，它是通过在加权线性表达式中组合多个工作特征而创建的，而不是仅使用单个特征的经典纯策略。这类更大的调度策略旨在提供更大的灵活性和适应性。我们使用空间覆盖和黑盒优化技术来探索混合策略的新空间，并研究它们如何适应工作负载的变化。我们进行了广泛的实验活动，通过这些活动我们表明:(1)即使是最好的纯策略也远非最优，(2)使用精心调整的混合策略将允许显着提高系统的性能。(3)我们还提供了经验证据，表明没有一个放之四海而皆准的策略，通过显示快速的工作量演变似乎阻止了经典的在线学习算法的有效性。

{"title":"Adapting Batch Scheduling to Workload Characteristics: What Can We Expect From Online Learning?","authors":"Arnaud Legrand, D. Trystram, Salah Zrigui","doi":"10.1109/IPDPS.2019.00077","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00077","url":null,"abstract":"Despite the impressive growth and size of super-computers, the computational power they provide still cannot match the demand. Efficient and fair resource allocation is a critical task. Super-computers use Resource and Job Management Systems to schedule applications, which is generally done by relying on generic index policies such as First Come First Served and Shortest Processing time First in combination with Backfilling strategies. Unfortunately, such generic policies often fail to exploit specific characteristics of real workloads. In this work, we focus on improving the performance of online schedulers. We study mixed policies, which are created by combining multiple job characteristics in a weighted linear expression, as opposed to classical pure policies which use only a single characteristic. This larger class of scheduling policies aims at providing more flexibility and adaptability. We use space coverage and black-box optimization techniques to explore this new space of mixed policies and we study how can they adapt to the changes in the workload. We perform an extensive experimental campaign through which we show that (1) even the best pure policy is far from optimal and that (2) using a carefully tuned mixed policy would allow to significantly improve the performance of the system. (3) We also provide empirical evidence that there is no one size fits all policy, by showing that the rapid workload evolution seems to prevent classical online learning algorithms from being effective.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131104731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Modelling DVFS and UFS for Region-Based Energy Aware Tuning of HPC Applications 基于区域的高性能计算应用能量感知调谐的DVFS和UFS建模

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00089

Mohak Chadha, M. Gerndt

Energy efficiency and energy conservation are one of the most crucial constraints for meeting the 20MW power envelope desired for exascale systems. Towards this, most of the research in this area has been focused on the utilization of user-controllable hardware switches such as per-core dynamic voltage frequency scaling (DVFS) and software controlled clock modulation at the application level. In this paper, we present a tuning plugin for the Periscope Tuning Framework which integrates fine-grained autotuning at the region level with DVFS and uncore frequency scaling (UFS). The tuning is based on a feed-forward neural network which is formulated using Performance Monitoring Counters (PMC) supported by x86 systems and trained using standardized benchmarks. Experiments on five standardized hybrid benchmarks show an energy improvement of 16.1% on average when the applications are tuned according to our methodology as compared to 7.8% for static tuning.

能源效率和节能是满足百亿亿级系统所需的20MW功率信封的最关键限制之一。为此，该领域的大部分研究都集中在用户可控硬件开关的利用上，如在应用层面上的每核动态电压频率缩放(DVFS)和软件控制时钟调制。在本文中，我们提出了一个用于Periscope调优框架的调优插件，它将区域级的细粒度自动调优与DVFS和非核心频率缩放(UFS)相结合。调优基于前馈神经网络，该网络使用x86系统支持的性能监控计数器(PMC)制定，并使用标准化基准进行训练。在五个标准化混合基准测试上进行的实验表明，根据我们的方法对应用程序进行调优时，能耗平均提高16.1%，而静态调优则为7.8%。

引用次数: 6

Optimizing the Parity Check Matrix for Efficient Decoding of RS-Based Cloud Storage Systems 基于rs的云存储系统的奇偶校验矩阵优化

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00063

Junqing Gu, Chentao Wu, Xin Xie, Han Qiu, Jie Li, M. Guo, Xubin He, Yuanyuan Dong, Yafei Zhao

In large scale distributed systems such as cloud storage systems, erasure coding is a fundamental technique to provide high reliability at low monetary cost. Compared with the traditional disk arrays, cloud storage systems use an erasure coding scheme with both flexible fault tolerance and high scalability. Thus, Reed-Solomon (RS) Codes or RS-based codes are popular choices for cloud storage systems. However, the decoding performance for RS-based codes is not as good as XOR-based codes, which are optimized via investigating the relationships among different parity chains or reducing the computational complexity of matrix multiplications. Therefore, exploring an efficient decoding method is highly desired. To address the above problem, in this paper, we propose an Advanced Parity-Check Matrix (APCM) based approach, which is extended from the original Parity-Check Matrix based (PCM) approach. Instead of improving the decoding performance of XOR-based codes in PCM, APCM focuses on optimizing the decoding efficiency for RS-based codes. Furthermore, APCM avoids the matrix inversion computations and reduces the computational complexity of the decoding process. To demonstrate the effectiveness of the APCM, we conduct intensive experiments by using both RS-based and XOR-based codes under cloud storage environment. The results show that, compared to typical decoding methods, APCM improves the decoding speed by up to 32.31% in the Alibaba cloud storage system.

在云存储系统等大规模分布式系统中，擦除编码是一种以低成本提供高可靠性的基本技术。与传统的磁盘阵列相比，云存储采用了容错灵活、扩展性强的erasure编码方案。因此，Reed-Solomon (RS)代码或基于RS的代码是云存储系统的流行选择。然而，基于rs的码的译码性能不如基于xor的码，xor是通过研究不同奇偶链之间的关系或降低矩阵乘法的计算复杂度来优化的。因此，迫切需要探索一种有效的解码方法。为了解决上述问题，本文提出了一种基于高级奇偶校验矩阵(APCM)的方法，该方法是基于原始的基于奇偶校验矩阵(PCM)方法的扩展。在PCM中，APCM的重点不是提高基于xor的码的译码性能，而是优化基于rs的码的译码效率。此外，APCM避免了矩阵反演计算，降低了解码过程的计算复杂度。为了证明APCM的有效性，我们在云存储环境下使用基于rs和基于xor的代码进行了大量实验。结果表明，在阿里云存储系统中，APCM与典型解码方法相比，解码速度提高了32.31%。

{"title":"Optimizing the Parity Check Matrix for Efficient Decoding of RS-Based Cloud Storage Systems","authors":"Junqing Gu, Chentao Wu, Xin Xie, Han Qiu, Jie Li, M. Guo, Xubin He, Yuanyuan Dong, Yafei Zhao","doi":"10.1109/IPDPS.2019.00063","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00063","url":null,"abstract":"In large scale distributed systems such as cloud storage systems, erasure coding is a fundamental technique to provide high reliability at low monetary cost. Compared with the traditional disk arrays, cloud storage systems use an erasure coding scheme with both flexible fault tolerance and high scalability. Thus, Reed-Solomon (RS) Codes or RS-based codes are popular choices for cloud storage systems. However, the decoding performance for RS-based codes is not as good as XOR-based codes, which are optimized via investigating the relationships among different parity chains or reducing the computational complexity of matrix multiplications. Therefore, exploring an efficient decoding method is highly desired. To address the above problem, in this paper, we propose an Advanced Parity-Check Matrix (APCM) based approach, which is extended from the original Parity-Check Matrix based (PCM) approach. Instead of improving the decoding performance of XOR-based codes in PCM, APCM focuses on optimizing the decoding efficiency for RS-based codes. Furthermore, APCM avoids the matrix inversion computations and reduces the computational complexity of the decoding process. To demonstrate the effectiveness of the APCM, we conduct intensive experiments by using both RS-based and XOR-based codes under cloud storage environment. The results show that, compared to typical decoding methods, APCM improves the decoding speed by up to 32.31% in the Alibaba cloud storage system.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120902982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

GraphTinker: A High Performance Data Structure for Dynamic Graph Processing GraphTinker:用于动态图处理的高性能数据结构

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00110

Wole Jaiyeoba, K. Skadron

Interest in high performance analytics for dynamic (constantly evolving) graphs has been on the rise in the last decade, especially due to the prevalence and rapid growth of social networks today. The current state-of-the art data structures for dynamic graph processing rely on the adjacency list model of edgeblocks in updating graphs. This model suffers from long probe distances when following edges, leading to poor update throughputs. Furthermore, both current graph processing models—the static model that requires reprocessing the entire graph after every batch update, and the incremental model in which only the affected subset of edges need to be processed—suffer drawbacks. In this paper, we present GraphTinker, a new, more scalable graph data structure for dynamic graphs. It uses a new hashing scheme to reduce probe distance and improve edge-update performance. It also better compacts edge data. These innovations improve performance for graph updates as well as graph analytics. In addition, we present a hybrid engine which improves the performance of dynamic graph processing by automatically selecting the most optimal execution model (static vs. incremental) for every iteration, surpassing the performance of both. Our evaluations of GraphTinker shows a throughput improvement of up to 3.3X compared to the state-of-the-art data structure (STINGER) when used for graph updates. GraphTinker also demonstrates a performance improvement of up to 10X over STINGER when used to run graph analytics algorithms. In addition, our hybrid engine demonstrates up to 2X improvement over the incremental-compute model and up to 3X improvement over the static model.

在过去十年中，对动态(不断发展的)图形的高性能分析的兴趣一直在上升，特别是由于今天社交网络的普及和快速增长。当前最先进的动态图处理数据结构依赖于更新图的边块邻接表模型。该模型在跟踪边缘时探测距离较长，导致更新吞吐量较差。此外，当前的两种图处理模型——每次批量更新后需要重新处理整个图的静态模型，以及只需要处理受影响的边缘子集的增量模型——都存在缺陷。在本文中，我们提出了GraphTinker，一个新的，更具可扩展性的动态图数据结构。它使用了一种新的哈希方案来减少探测距离，提高边缘更新性能。它还可以更好地压缩边缘数据。这些创新提高了图形更新和图形分析的性能。此外，我们提出了一种混合引擎，它通过在每次迭代中自动选择最优的执行模型(静态与增量)来提高动态图处理的性能，超越了两者的性能。我们对GraphTinker的评估显示，当用于图形更新时，与最先进的数据结构(STINGER)相比，吞吐量提高了3.3倍。在运行图形分析算法时，GraphTinker的性能也比STINGER提高了10倍。此外，我们的混合动力引擎比增量计算模型提高了2倍，比静态模型提高了3倍。

{"title":"GraphTinker: A High Performance Data Structure for Dynamic Graph Processing","authors":"Wole Jaiyeoba, K. Skadron","doi":"10.1109/IPDPS.2019.00110","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00110","url":null,"abstract":"Interest in high performance analytics for dynamic (constantly evolving) graphs has been on the rise in the last decade, especially due to the prevalence and rapid growth of social networks today. The current state-of-the art data structures for dynamic graph processing rely on the adjacency list model of edgeblocks in updating graphs. This model suffers from long probe distances when following edges, leading to poor update throughputs. Furthermore, both current graph processing models—the static model that requires reprocessing the entire graph after every batch update, and the incremental model in which only the affected subset of edges need to be processed—suffer drawbacks. In this paper, we present GraphTinker, a new, more scalable graph data structure for dynamic graphs. It uses a new hashing scheme to reduce probe distance and improve edge-update performance. It also better compacts edge data. These innovations improve performance for graph updates as well as graph analytics. In addition, we present a hybrid engine which improves the performance of dynamic graph processing by automatically selecting the most optimal execution model (static vs. incremental) for every iteration, surpassing the performance of both. Our evaluations of GraphTinker shows a throughput improvement of up to 3.3X compared to the state-of-the-art data structure (STINGER) when used for graph updates. GraphTinker also demonstrates a performance improvement of up to 10X over STINGER when used to run graph analytics algorithms. In addition, our hybrid engine demonstrates up to 2X improvement over the incremental-compute model and up to 3X improvement over the static model.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126762861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Computation of Matrix Chain Products on Parallel Machines 并联机上矩阵链积的计算

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00059

Elad Weiss, O. Schwartz

The Matrix Chain Ordering Problem is a well studied optimization problem, aiming at finding optimal parentheses assignment for minimizing the number of arithmetic operations required when computing a chain of matrix multiplications. Existing algorithms include the O(N^3) dynamic programming of Godbole (1973) and the faster O(NlogN) algorithm of Hu and Shing (1982). We show that both may result in sub-optimal parentheses assignment for modern machines as they do not take into account inter-processor communication costs that often dominate the running time. Further, the optimal solution may change when using fast matrix multiplication algorithms. We adapt the O(N^3) dynamic programming algorithm to provide optimal solutions for modern machines and modern matrix multiplication algorithms, and obtain an adaption of the O(NlogN) algorithm that guarantees a constant approximation.

矩阵链排序问题是一个被广泛研究的优化问题，其目的是在计算矩阵乘法链时找到最优的括号分配，以使所需的算术运算次数最少。现有的算法有Godbole(1973)的O(N^3)动态规划算法和Hu and Shing(1982)更快的O(NlogN)算法。我们表明，对于现代机器，这两种方法都可能导致次优括号赋值，因为它们没有考虑处理器间通信成本，而处理器间通信成本通常是运行时间的主要因素。此外，当使用快速矩阵乘法算法时，最优解可能会发生变化。我们采用O(N^3)动态规划算法来为现代机器和现代矩阵乘法算法提供最优解，并获得了O(NlogN)算法的自适应，保证了常数近似。

引用次数: 2

IPDPS 2019 Technical Program IPDPS 2019技术计划

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/ipdps.2019.00008

Vinod E. F. Rebello, Lawrence Rauchwerger

: In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum. Abstract: Parallel computers have come of age and need parallel software to justify their usefulness. There are two major avenues to get programs to run in parallel: parallelizing compilers and parallel languages and/or libraries. In this talk we present our latest results using both approaches and draw some conclusions about their relative effectiveness and potential. In the first part we introduce the Hybrid Analysis (HA) compiler framework that can seamlessly integrate static and run-time analysis of memory references into a single framework capable of full automatic loop level parallelization. Experimental results on 26 benchmarks show full program speedups superior to those obtained by the Intel Fortran compilers. In the second part of this talk we present the Standard Template Adaptive Parallel Library (STAPL) based approach to parallelizing code. STAPL is a collection of generic data structures and algorithms that provides a high productivity, parallel programming infrastructure analogous to the C++ Standard Template Library (STL). In this talk, we provide an overview of the major STAPL components with particular emphasis on graph algorithms. We then present scalability results of real codes using peta scale machines such as IBM BG/Q and Cray. Finally we present some of our ideas for future work in this area. Abstract: The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not l

2001年，当早期高速网络开始部署时，乔治•吉尔德(George Gilder)观察到:“当网络的速度与计算机内部连接的速度一样快时，计算机就会在网络上分解成一组特殊用途的设备。”二十年后，我们的网络速度快了1000倍，我们的设备越来越专业化，我们的计算机系统确实在瓦解。当硬件加速克服光速延迟时，时间和空间合并成一个计算连续体。熟悉的问题，如“我应该在哪里计算”，“我应该为什么样的工作负载设计计算机”，以及“我应该把计算机放在哪里”，似乎提供了无数令人兴奋但也令人生畏的新答案。当我们在一个不受中心、云、边缘等熟悉的标志束缚的世界中设计应用程序和计算机系统时，是否有一些概念可以帮助指导我们?我提出了一些想法，并报告了对连续体进行编码的实验。摘要:并行计算机已经成熟，需要并行软件来证明其实用性。让程序并行运行有两种主要途径:并行编译器和并行语言和/或库。在这次演讲中，我们将介绍我们使用这两种方法的最新结果，并就它们的相对有效性和潜力得出一些结论。在第一部分中，我们介绍了混合分析(Hybrid Analysis, HA)编译器框架，它可以无缝地将内存引用的静态和运行时分析集成到一个能够实现全自动循环级并行化的框架中。在26个基准测试上的实验结果表明，完整的程序速度优于英特尔Fortran编译器获得的速度。在本演讲的第二部分，我们将介绍基于标准模板自适应并行库(STAPL)的并行代码处理方法。STAPL是通用数据结构和算法的集合，它提供了类似于c++标准模板库(STL)的高生产率并行编程基础设施。在这次演讲中，我们提供了主要STAPL组件的概述，特别强调图算法。然后，我们展示了使用peta级机器(如IBM BG/Q和Cray)的真实代码的可扩展性结果。最后，对今后在这方面的工作提出了一些设想。摘要:硬件架构的发展趋势正在为Exascale的发展铺平道路。然而，这些趋势也增加了部署在现代超级计算机上的软件开发人员环境的设计和开发的复杂性。此外，高端系统的规模和复杂性给应用程序开发人员带来了一系列新的挑战。计算科学家正面临着将显著影响应用程序可编程性和可扩展性的系统特性。为了解决这些问题，软件架构师需要从整体上看待整个系统，并交付一个高级编程环境，以帮助最大化可编程性，同时不忽略性能可移植性。在这次演讲中，我将讨论计算机体系结构的当前趋势及其对应用程序开发的影响，并将介绍Cray在当前和未来超级计算机上的高性能和可编程性的高级并行编程环境。我还将讨论一些需要解决的挑战和开放的研究问题，以便为极端规模的系统构建一个软件开发环境，帮助用户解决具有高水平性能、可编程性和可扩展性的多学科和多规模问题。

{"title":"IPDPS 2019 Technical Program","authors":"Vinod E. F. Rebello, Lawrence Rauchwerger","doi":"10.1109/ipdps.2019.00008","DOIUrl":"https://doi.org/10.1109/ipdps.2019.00008","url":null,"abstract":": In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and \"where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum. Abstract: Parallel computers have come of age and need parallel software to justify their usefulness. There are two major avenues to get programs to run in parallel: parallelizing compilers and parallel languages and/or libraries. In this talk we present our latest results using both approaches and draw some conclusions about their relative effectiveness and potential. In the first part we introduce the Hybrid Analysis (HA) compiler framework that can seamlessly integrate static and run-time analysis of memory references into a single framework capable of full automatic loop level parallelization. Experimental results on 26 benchmarks show full program speedups superior to those obtained by the Intel Fortran compilers. In the second part of this talk we present the Standard Template Adaptive Parallel Library (STAPL) based approach to parallelizing code. STAPL is a collection of generic data structures and algorithms that provides a high productivity, parallel programming infrastructure analogous to the C++ Standard Template Library (STL). In this talk, we provide an overview of the major STAPL components with particular emphasis on graph algorithms. We then present scalability results of real codes using peta scale machines such as IBM BG/Q and Cray. Finally we present some of our ideas for future work in this area. Abstract: The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not l","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123541068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IPDPS 2019 Reviewers IPDPS 2019审稿人

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/ipdps.2019.00010

引用次数: 0

Design Space Exploration of Next-Generation HPC Machines 下一代高性能计算机器的设计空间探索

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00017

Constantino Gómez, Francesc Martínez, Adrià Armejach, Miquel Moretó, F. Mantovani, Marc Casas

The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. With the goal of improving the efficiency of next-generation large HPC systems, designers require tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. We simulate five hybrid (MPI+OpenMP) applications over 864 architectural proposals based on state-of-the-art and emerging HPC technologies, relevant both in industry and research. This paper significantly extends our previous work with MUltiscale Simulation Approach (MUSA) enabling accurate performance and power estimations of large-scale HPC systems. We reveal that several applications present critical scalability issues mostly due to the software parallelization approach. Looking at speedup and energy consumption exploring the design space (i.e., changing memory bandwidth, number of cores, and type of cores), we provide evidence-based architectural recommendations that will serve as hardware and software co-design guidelines.

高性能计算(HPC)系统架构随着新技术和复杂性的增加而不断扩展。为了提高下一代大型高性能计算系统的效率，设计人员需要工具来分析和预测新架构特性对大规模复杂科学应用程序性能的影响。我们模拟了五种混合(MPI+OpenMP)应用程序，基于最先进的和新兴的高性能计算技术，在工业和研究中都有相关的864种架构方案。本文通过多尺度仿真方法(MUSA)极大地扩展了我们以前的工作，使大规模高性能计算系统的性能和功率能够准确估计。我们揭示了几个应用程序存在关键的可伸缩性问题，主要是由于软件并行化方法。着眼于加速和能耗探索设计空间(即，改变内存带宽、内核数量和内核类型)，我们提供了基于证据的架构建议，这些建议将作为硬件和软件协同设计指南。

{"title":"Design Space Exploration of Next-Generation HPC Machines","authors":"Constantino Gómez, Francesc Martínez, Adrià Armejach, Miquel Moretó, F. Mantovani, Marc Casas","doi":"10.1109/IPDPS.2019.00017","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00017","url":null,"abstract":"The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. With the goal of improving the efficiency of next-generation large HPC systems, designers require tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. We simulate five hybrid (MPI+OpenMP) applications over 864 architectural proposals based on state-of-the-art and emerging HPC technologies, relevant both in industry and research. This paper significantly extends our previous work with MUltiscale Simulation Approach (MUSA) enabling accurate performance and power estimations of large-scale HPC systems. We reveal that several applications present critical scalability issues mostly due to the software parallelization approach. Looking at speedup and energy consumption exploring the design space (i.e., changing memory bandwidth, number of cores, and type of cores), we provide evidence-based architectural recommendations that will serve as hardware and software co-design guidelines.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130266150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

DYRS: Bandwidth-Aware Disk-to-Memory Migration of Cold Data in Big-Data File Systems 大数据文件系统中冷数据的带宽感知磁盘到内存迁移

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00069

Simbarashe Dzinamarira, Florin Dinu, T. Ng

Migrating data into memory can significantly accelerate big-data applications by hiding low disk throughput. While prior work has mostly targeted caching frequently used data, the techniques employed do not benefit jobs that read cold data. For these jobs, the file system has to pro-actively migrate the inputs into memory. Successfully migrating cold inputs can result in a large speedup for many jobs, especially those that spend a significant part of their execution reading inputs. In this paper, we use data from the Google cluster trace to make the case that the conditions in production workloads are favorable for migration. We then design and implement DYRS, a framework for migrating cold data in big-data file systems. DYRS can adapt to match the available bandwidth on storage nodes, ensuring all nodes are fully utilized throughout the migration. In addition to balancing the load, DYRS optimizes the placement of each migration to maximize the number of successful migrations and eliminate stragglers at the end of a job. We evaluate DYRS using several Hive queries, a trace-based workload from Facebook, and the Sort application. Our results show that DYRS successfully adapts to bandwidth heterogeneity and effectively migrates data. DYRS accelerates Hive queries by up to 48%, and by 36% on average. Jobs in a trace-based workload experience a speedup of 33% on average. The mapper tasks in this workload have an even greater speedup of 46%. DYRS accelerates sort jobs by up to 20%.

通过隐藏低磁盘吞吐量，将数据迁移到内存中可以显著加快大数据应用的速度。虽然以前的工作主要针对缓存频繁使用的数据，但所采用的技术对读取冷数据的作业没有好处。对于这些作业，文件系统必须主动地将输入迁移到内存中。成功地迁移冷输入可以为许多作业带来很大的加速，特别是那些在执行过程中花费大量时间读取输入的作业。在本文中，我们使用来自Google集群跟踪的数据来证明生产工作负载中的条件有利于迁移。然后，我们设计并实现了DYRS，一个在大数据文件系统中迁移冷数据的框架。DYRS可以适应匹配存储节点上的可用带宽，确保在整个迁移过程中充分利用所有节点。除了平衡负载之外，DYRS还优化了每次迁移的位置，以最大限度地增加成功迁移的数量，并在作业结束时消除掉队的迁移。我们使用几个Hive查询、来自Facebook的基于跟踪的工作负载和Sort应用程序来评估DYRS。实验结果表明，DYRS算法能够适应带宽的异构性，实现数据的有效迁移。DYRS将Hive查询加速了48%，平均提高了36%。基于跟踪的工作负载中的作业平均加速33%。此工作负载中的映射器任务的加速甚至更高，达到46%。DYRS将排序工作加速了20%。

{"title":"DYRS: Bandwidth-Aware Disk-to-Memory Migration of Cold Data in Big-Data File Systems","authors":"Simbarashe Dzinamarira, Florin Dinu, T. Ng","doi":"10.1109/IPDPS.2019.00069","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00069","url":null,"abstract":"Migrating data into memory can significantly accelerate big-data applications by hiding low disk throughput. While prior work has mostly targeted caching frequently used data, the techniques employed do not benefit jobs that read cold data. For these jobs, the file system has to pro-actively migrate the inputs into memory. Successfully migrating cold inputs can result in a large speedup for many jobs, especially those that spend a significant part of their execution reading inputs. In this paper, we use data from the Google cluster trace to make the case that the conditions in production workloads are favorable for migration. We then design and implement DYRS, a framework for migrating cold data in big-data file systems. DYRS can adapt to match the available bandwidth on storage nodes, ensuring all nodes are fully utilized throughout the migration. In addition to balancing the load, DYRS optimizes the placement of each migration to maximize the number of successful migrations and eliminate stragglers at the end of a job. We evaluate DYRS using several Hive queries, a trace-based workload from Facebook, and the Sort application. Our results show that DYRS successfully adapts to bandwidth heterogeneity and effectively migrates data. DYRS accelerates Hive queries by up to 48%, and by 36% on average. Jobs in a trace-based workload experience a speedup of 33% on average. The mapper tasks in this workload have an even greater speedup of 46%. DYRS accelerates sort jobs by up to 20%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115784112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Tight & Simple Load Balancing 紧凑和简单的负载平衡

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00080

P. Berenbrink, Tom Friedetzky, Dominik Kaaser, Peter Kling

We consider the following load balancing process for m tokens distributed arbitrarily among n nodes connected by a complete graph. In each time step a pair of nodes is selected uniformly at random. Let ℓ_1 and ℓ_2 be their respective number of tokens. The two nodes exchange tokens such that they have ⌈(ℓ_1 + ℓ_2)/2⌉ and ⌈(ℓ_1 + ℓ_2)/2⌉ tokens, respectively. We provide a simple analysis showing that this process reaches almost perfect balance within O(n log n + n log Δ) steps with high probability, where Δ is the maximal initial load difference between any two nodes. This bound is asymptotically tight.

我们考虑m个令牌随机分布在由完全图连接的n个节点上的负载平衡过程。在每个时间步长中均匀随机选择一对节点。设_1和_2是它们各自的符号数。这两个节点交换令牌，使它们分别具有< <(__1 + __2)/2²和< __1 + __2)/2²令牌。我们提供了一个简单的分析，表明该过程在O(n log n + n log Δ)步内以高概率达到几乎完美的平衡，其中Δ是任意两个节点之间的最大初始负载差。这个界是渐近紧的。

引用次数: 12