2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL 多核存储系统的能力模型:Xeon Phi KNL的案例研究

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-07-03 DOI: 10.1109/IPDPS.2017.30

Sabela Ramos, T. Hoefler

Increasingly complex memory systems and onchip interconnects are developed to mitigate the data movement bottlenecks in manycore processors. One example of such a complex system is the Xeon Phi KNL CPU with three different types of memory, fifteen memory configuration options, and a complex on-chip mesh network connecting up to 72 cores. Users require a detailed understanding of the performance characteristics of the different options to utilize the system efficiently. Unfortunately, peak performance is rarely achievable and achievable performance is hardly documented. We address this with capability models of the memory subsystem, derived by systematic measurements, to guide users to navigate the complex optimization space. As a case study, we provide an extensive model of all memory configuration options for Xeon Phi KNL. We demonstrate how our capability model can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel’s tuned OpenMP and MPI implementations, respectively. Furthermore, we demonstrate how to use the models to assess how efficiently a bitonic sort application utilizes the memory resources. Interestingly, our capability models predict and explain that the high bandwidthMCDRAM does not improve the bitonic sort performance over DRAM.

越来越复杂的存储系统和片上互连的发展，以减轻多核处理器的数据移动瓶颈。这种复杂系统的一个例子是Xeon Phi KNL CPU，它具有三种不同类型的内存，15种内存配置选项和一个复杂的片上网状网络，最多可连接72个内核。用户需要详细了解不同选项的性能特征，以便有效地利用系统。不幸的是，峰值性能很难实现，而可实现的性能几乎没有记录。我们通过系统测量得出内存子系统的能力模型来解决这个问题，以指导用户在复杂的优化空间中导航。作为案例研究，我们提供了Xeon Phi KNL的所有内存配置选项的扩展模型。我们演示了如何使用我们的能力模型自动为各种通信功能生成新的接近最优的算法，这些算法分别比英特尔调优的OpenMP和MPI实现提高了5倍和24倍。此外，我们还演示了如何使用这些模型来评估bitonic排序应用程序利用内存资源的效率。有趣的是，我们的能力模型预测并解释了高带宽mcdram并不比DRAM提高双次排序性能。

{"title":"Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL","authors":"Sabela Ramos, T. Hoefler","doi":"10.1109/IPDPS.2017.30","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.30","url":null,"abstract":"Increasingly complex memory systems and onchip interconnects are developed to mitigate the data movement bottlenecks in manycore processors. One example of such a complex system is the Xeon Phi KNL CPU with three different types of memory, fifteen memory configuration options, and a complex on-chip mesh network connecting up to 72 cores. Users require a detailed understanding of the performance characteristics of the different options to utilize the system efficiently. Unfortunately, peak performance is rarely achievable and achievable performance is hardly documented. We address this with capability models of the memory subsystem, derived by systematic measurements, to guide users to navigate the complex optimization space. As a case study, we provide an extensive model of all memory configuration options for Xeon Phi KNL. We demonstrate how our capability model can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel’s tuned OpenMP and MPI implementations, respectively. Furthermore, we demonstrate how to use the models to assess how efficiently a bitonic sort application utilizes the memory resources. Interestingly, our capability models predict and explain that the high bandwidthMCDRAM does not improve the bitonic sort performance over DRAM.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128193529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework 生产硬件过度配置:使用可扩展的电源感知资源管理框架的实际性能优化

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-06-30 DOI: 10.1109/IPDPS.2017.107

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, M. Ueda, Tapasya Patki, D. Ellsworth, B. Rountree, M. Schulz

Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.

有限的电力预算将是部署未来百亿亿次超级计算机的最大挑战之一。应对这一挑战的一个很有前途的方法是硬件过度配置，也就是说，在给定的功率限制下，安装比完全供电更多的硬件资源，再加上软件机制，将有限的功率引导到最需要的地方。先前的研究已经证明了这种方法的可行性，但只能依赖于软件堆栈的小规模模拟。虽然这样的研究有助于理解可以实现的性能优势的界限，但它没有涵盖在生产系统上使用过度供应的任何部署或操作问题。本文基于广泛建立的SLURM资源管理器，首次为生产规模的过度供应系统提供了一个可扩展的电源感知资源管理框架。我们的框架为电源管理提供了灵活的插件接口和api，可以很容易地扩展到实现特定于站点的策略和比较不同的电源管理技术。我们在九州大学的965节点HA8000生产系统上演示了我们的框架。我们的结果表明，在生产中安全地过量配置硬件确实是可能的。我们还发现，空闲节点的功耗取决于过度供应的程度，这可能成为瓶颈。然后，通过使用真实数据，我们得出关于在过度供应的环境中提供的节点总数的影响的结论。

{"title":"Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework","authors":"Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, M. Ueda, Tapasya Patki, D. Ellsworth, B. Rountree, M. Schulz","doi":"10.1109/IPDPS.2017.107","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.107","url":null,"abstract":"Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132530364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Toucan — A Translator for Communication Tolerant MPI Applications 巨嘴鸟-一个用于通信容忍MPI应用程序的翻译器

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-06-30 DOI: 10.1109/IPDPS.2017.44

Sergio M. Martin, M. Berger, S. Baden

We discuss early results with Toucan, a source-to-source translatorthat automatically restructures C/C++ MPI applications tooverlap communication with computation. We co-designed thetranslator and runtime system to enable dynamic, dependence-drivenexecution of MPI applications, and require only a modest amount ofprogrammer annotation. Co-design was essential to realizingoverlap through dynamic code block reordering and avoiding the limitations of static code relocation and inlining. We demonstrate that Toucan hides significantcommunication in four representative applications running on up to 24Kcores of NERSC's Edison platform. Using Toucan, we have hidden from 33% to 85% of the communication overhead, with performance meeting or exceeding that of painstakingly hand-written overlap variants.

我们用Toucan讨论了早期的结果，Toucan是一个源到源的翻译器，可以自动重组C/ c++ MPI应用程序，使通信与计算重叠。我们共同设计了转换器和运行时系统，以实现MPI应用程序的动态，依赖驱动的执行，并且只需要少量的程序员注释。协同设计是通过动态代码块重排序和避免静态代码重定位和内联的限制来实现重叠的关键。我们展示了Toucan在四个典型应用程序中隐藏了重要的通信，这些应用程序运行在NERSC的Edison平台的高达24k的内核上。使用Toucan，我们隐藏了33%到85%的通信开销，性能达到或超过了辛苦编写的重叠变体。

引用次数: 7

Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs 多核和gpu上基于任务的运行时系统快速高效列表调度算法的逼近证明

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-29 DOI: 10.1109/IPDPS.2017.71

Olivier Beaumont, Lionel Eyraud-Dubois, Suraj Kumar

In High Performance Computing, heterogeneity is now the normwith specialized accelerators like GPUs providing efficientcomputational power. The added complexity has led to the developmentof task-based runtime systems, which allow complex computations to beexpressed as task graphs, and rely on scheduling algorithms to performload balancing between all resources of the platforms. Developing goodscheduling algorithms, even on a single node, and analyzing them canthus have a very high impact on the performance of current HPCsystems. The special case of two types of resources (namely CPUs andGPUs) is of practical interest. HeteroPrio is such an algorithm whichhas been proposed in the context of fast multipole computations, andthen extended to general task graphs with very interesting results. Inthis paper, we provide a theoretical insight on the performance ofHeteroPrio, by proving approximation bounds compared to the optimalschedule in the case where all tasks are independent and for differentplatform sizes. Interestingly, this shows that spoliation allows toprove approximation ratios for a list scheduling algorithm on twounrelated resources, which is not possible otherwise. We also establishthat almost all our bounds are tight. Additionally, we provide anexperimental evaluation of HeteroPrio on real task graphs from denselinear algebra computation, which highlights the reasons explainingits good practical performance.

在高性能计算中，异构现在是标准，像gpu这样的专用加速器提供了高效的计算能力。增加的复杂性导致了基于任务的运行时系统的发展，它允许将复杂的计算表示为任务图，并依赖于调度算法在平台的所有资源之间执行负载平衡。开发好的调度算法，即使是在单个节点上，并对其进行分析，对当前高性能计算系统的性能有很大的影响。两种类型的资源(即cpu和gpu)的特殊情况具有实际意义。HeteroPrio就是这样一种算法，它是在快速多极计算的背景下提出的，然后扩展到一般的任务图，得到了非常有趣的结果。在本文中，我们通过证明在所有任务独立且不同平台大小的情况下与最优调度相比的近似界，提供了对heteroprio性能的理论见解。有趣的是，这表明掠夺允许在两个相关资源上证明列表调度算法的近似比率，这在其他情况下是不可能的。我们还确定了几乎所有的边界都是紧的。此外，我们还从密集线性代数计算的实际任务图上对HeteroPrio进行了实验评估，这突出了解释其良好实际性能的原因。

{"title":"Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs","authors":"Olivier Beaumont, Lionel Eyraud-Dubois, Suraj Kumar","doi":"10.1109/IPDPS.2017.71","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.71","url":null,"abstract":"In High Performance Computing, heterogeneity is now the normwith specialized accelerators like GPUs providing efficientcomputational power. The added complexity has led to the developmentof task-based runtime systems, which allow complex computations to beexpressed as task graphs, and rely on scheduling algorithms to performload balancing between all resources of the platforms. Developing goodscheduling algorithms, even on a single node, and analyzing them canthus have a very high impact on the performance of current HPCsystems. The special case of two types of resources (namely CPUs andGPUs) is of practical interest. HeteroPrio is such an algorithm whichhas been proposed in the context of fast multipole computations, andthen extended to general task graphs with very interesting results. Inthis paper, we provide a theoretical insight on the performance ofHeteroPrio, by proving approximation bounds compared to the optimalschedule in the case where all tasks are independent and for differentplatform sizes. Interestingly, this shows that spoliation allows toprove approximation ratios for a list scheduling algorithm on twounrelated resources, which is not possible otherwise. We also establishthat almost all our bounds are tight. Additionally, we provide anexperimental evaluation of HeteroPrio on real task graphs from denselinear algebra computation, which highlights the reasons explainingits good practical performance.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121826196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation 双对角化和r -双对角化:并行平铺算法、关键路径和分布式内存实现

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-29 DOI: 10.1109/IPDPS.2017.46

Mathieu Faverge, J. Langou, Y. Robert, J. Dongarra

We study tiled algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.

我们研究了使用正交变换从“满”矩阵到浓缩“带双对角”形式的平铺算法:(i)平铺双对角化算法BIDIAG，它是标准标量双对角化算法的平铺版本;(ii) R-双对角化算法R-bidiag，这是该算法的平纹版本，它包括首先对初始矩阵进行QR分解，然后对R-因子进行带双对角化。对于BIDIAG和R-BIDIAG，我们使用了四种主要类型的约简树，即FLATTS、FLATTT、GREEDY和新引入的自适应树AUTO。我们对这些平铺算法的关键路径长度进行了研究，结果表明(i)对于高矩阵和瘦矩阵，R-BIDIAG具有比BIDIAG更短的关键路径长度，并且(ii)基于贪心的方案比先前提出的无界资源算法要好得多。我们在一个多核节点和一个并行分布式共享内存系统的几个多核节点上进行了实验，以显示新算法在各种矩阵大小、矩阵形状和核数上的优越性。

{"title":"Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation","authors":"Mathieu Faverge, J. Langou, Y. Robert, J. Dongarra","doi":"10.1109/IPDPS.2017.46","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.46","url":null,"abstract":"We study tiled algorithms for going from a \"full\" matrix to a condensed \"band bidiagonal\" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132194454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Dynamic Memory-Aware Task-Tree Scheduling 动态内存感知任务树调度

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-29 DOI: 10.1109/IPDPS.2017.58

G. Aupy, Clement Brasseur, L. Marchal

Factorizing sparse matrices using direct multifrontal methods generates directed tree-shaped task graphs, where edges represent data dependency between tasks. This paper revisits the execution of tree-shaped task graphs using multiple processors that share a bounded memory. A task can only be executed if all its input and output data can fit into the memory. The key difficulty is to manage the order of the task executions so that we can achieve high parallelism while staying below the memory bound. In particular, because input data of unprocessed tasks must be kept in memory, a bad scheduling strategy might compromise the termination of the algorithm. In the single processor case, solutions that are guaranteed to be below a memory bound are known. The multi-processor case (when one tries to minimize the total completion time) has been shown to be NP-complete. We present in this paper a novel heuristic solution that has a low complexity and is guaranteed to complete the tree within a given memory bound.We compare our algorithm to state of the art strategies, and observe that on both actual execution trees and synthetic trees, we always perform better than these solutions, with average speedups between 1.25 and 1.45 on actual assembly trees. Moreover, we show that the overhead of our algorithm is negligible even on deep trees (10 5), and would allow its runtime execution.

使用直接多额方法分解稀疏矩阵生成有向树形任务图，其中边表示任务之间的数据依赖关系。本文回顾了使用共享有限内存的多个处理器执行树形任务图的过程。只有当任务的所有输入和输出数据都能装入内存时，任务才能执行。关键的困难是管理任务执行的顺序，以便我们可以在保持低于内存限制的情况下实现高并行性。特别是，由于未处理任务的输入数据必须保存在内存中，因此糟糕的调度策略可能会危及算法的终止。在单处理器的情况下，已知的解决方案保证低于内存边界。多处理器情况(当试图最小化总完成时间时)已被证明是np完成的。本文提出了一种新的启发式解决方案，它具有较低的复杂性，并保证在给定的内存范围内完成树。我们将我们的算法与最先进的策略进行比较，并观察到在实际执行树和合成树上，我们的性能总是比这些解决方案更好，在实际装配树上的平均加速在1.25到1.45之间。此外，我们还表明，即使在深度树(10 5)上，我们的算法的开销也可以忽略不计，并且允许其运行时执行。

{"title":"Dynamic Memory-Aware Task-Tree Scheduling","authors":"G. Aupy, Clement Brasseur, L. Marchal","doi":"10.1109/IPDPS.2017.58","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.58","url":null,"abstract":"Factorizing sparse matrices using direct multifrontal methods generates directed tree-shaped task graphs, where edges represent data dependency between tasks. This paper revisits the execution of tree-shaped task graphs using multiple processors that share a bounded memory. A task can only be executed if all its input and output data can fit into the memory. The key difficulty is to manage the order of the task executions so that we can achieve high parallelism while staying below the memory bound. In particular, because input data of unprocessed tasks must be kept in memory, a bad scheduling strategy might compromise the termination of the algorithm. In the single processor case, solutions that are guaranteed to be below a memory bound are known. The multi-processor case (when one tries to minimize the total completion time) has been shown to be NP-complete. We present in this paper a novel heuristic solution that has a low complexity and is guaranteed to complete the tree within a given memory bound.We compare our algorithm to state of the art strategies, and observe that on both actual execution trees and synthetic trees, we always perform better than these solutions, with average speedups between 1.25 and 1.45 on actual assembly trees. Moreover, we show that the overhead of our algorithm is negligible even on deep trees (10 5), and would allow its runtime execution.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125204083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems 集成CPU-GPU系统中带功率上限的协同运行调度

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.124

Qingnhua Zhu, Bo Wu, Xipeng Shen, Li Shen, Zhiying Wang

This paper presents the first systematic study on co-scheduling independent jobs on integrated CPU-GPU systems with power caps considered. It reveals the performance degradations caused by the co-run contentions at the levels of both memory and power. It then examines the problem of using job co-scheduling to alleviate the degradations in this less understood scenario. It offers several algorithms and a lightweight co-run performance and power predictive model for computing the performance bounds of the optimal co-schedules and finding appropriate schedules. Results show that the method can efficiently find co-schedules that significantly improve the system throughput (9-46% on average over the default schedules).

本文首次系统地研究了考虑功率上限的CPU-GPU集成系统上独立作业的协同调度问题。它揭示了在内存和功耗级别上由共同运行争用引起的性能下降。然后讨论了在这种不太了解的场景中使用作业协同调度来减轻性能下降的问题。它提供了几种算法和轻量级的协同运行性能和功耗预测模型，用于计算最优协同调度的性能界限并找到合适的调度。结果表明，该方法可以有效地找到显著提高系统吞吐量的协同调度(平均比默认调度提高9-46%)。

引用次数: 27

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory 高带宽多核处理器上的稀疏张量分解

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.84

Shaden Smith, Jongsoo Park, G. Karypis

HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands. To address these challenging demands, HPC systems are turning to many-core architectures that feature a large number of energy-efficient cores backed by high-bandwidth memory. These features are exemplified in Intel's recent Knights Landing many-core processor (KNL), which typically has 68 cores and 16GB of on-package multi-channel DRAM (MCDRAM). This work investigates how the novel architectural features offered by KNL can be used in the context of decomposing sparse, unstructured tensors using the canonical polyadic decomposition (CPD). The CPD is used extensively to analyze large multi-way datasets arising in various areas including precision healthcare, cybersecurity, and e-commerce. Towards this end, we (i) develop problem decompositions for the CPD which are amenable to hundreds of concurrent threads while maintaining load balance and low synchronization costs; and (ii) explore the utilization of architectural features such as MCDRAM. Using one KNL processor, our algorithm achieves up to 1.8x speedup over a dual socket Intel Xeon system with 44 cores.

HPC系统越来越多地用于数据密集型计算，这些计算表现出不规则的内存访问、不均匀的工作分布、大内存占用和高内存带宽需求。为了解决这些具有挑战性的需求，HPC系统正在转向多核架构，这些架构具有大量高带宽内存支持的节能核心。这些特性在英特尔最近的Knights Landing多核处理器(KNL)中得到了体现，该处理器通常具有68核和16GB的封装多通道DRAM (MCDRAM)。这项工作研究了KNL提供的新架构特征如何在使用规范多进分解(CPD)分解稀疏、非结构化张量的背景下使用。CPD被广泛用于分析各种领域的大型多路数据集，包括精准医疗、网络安全和电子商务。为此，我们(i)为CPD开发问题分解，在保持负载平衡和低同步成本的同时，可适应数百个并发线程;(ii)探索MCDRAM等架构特性的利用。使用一个KNL处理器，我们的算法在具有44核的双插槽Intel至强系统上实现了高达1.8倍的加速。

{"title":"Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory","authors":"Shaden Smith, Jongsoo Park, G. Karypis","doi":"10.1109/IPDPS.2017.84","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.84","url":null,"abstract":"HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands. To address these challenging demands, HPC systems are turning to many-core architectures that feature a large number of energy-efficient cores backed by high-bandwidth memory. These features are exemplified in Intel's recent Knights Landing many-core processor (KNL), which typically has 68 cores and 16GB of on-package multi-channel DRAM (MCDRAM). This work investigates how the novel architectural features offered by KNL can be used in the context of decomposing sparse, unstructured tensors using the canonical polyadic decomposition (CPD). The CPD is used extensively to analyze large multi-way datasets arising in various areas including precision healthcare, cybersecurity, and e-commerce. Towards this end, we (i) develop problem decompositions for the CPD which are amenable to hundreds of concurrent threads while maintaining load balance and low synchronization costs; and (ii) explore the utilization of architectural features such as MCDRAM. Using one KNL processor, our algorithm achieves up to 1.8x speedup over a dual socket Intel Xeon system with 44 cores.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121281108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Leader Election in a Smartphone Peer-to-Peer Network 智能手机点对点网络中的领导者选举

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.11

Calvin C. Newport

In this paper, we study the fundamental problem of leader election in the mobile telephone model: a recently introduced variation of the classical telephone model modified to better describe the local peer-to-peercommunication services implemented in many popular smartphone operating systems. In more detail, the mobile telephone model differs from the classical telephone model in three ways: (1) each devicecan participate in at most one connection per round; (2) the network topology can undergo a parameterizedrate of change; and (3) devices can advertise a parameterized number of bits to their neighbors in each round before connection attempts are initiated. We begin by describing and analyzing a new leader election algorithm in this model that works under the harshest possible parameter assumptions: maximum rate of topology changes and no advertising bits. We then apply this result to resolve an open question from [Ghaffari, 2016] on the efficiency of PUSH-PULL rumor spreading under these conditions. We then turn our attention to the slightly easier case where devices can advertise a single bit in each round. We demonstrate a large gap in time complexity between these zero bit and one bit cases. In more detail, we describe and analyze a new algorithm that solves leader election with a time complexitythat includes the parameter bounding topology changes. For all values of this parameter, this algorithm is faster than the previous result, with a gap that grows quickly as the parameter increases (indicating lower rates of change). We conclude by describing and analyzing a modified version of this algorithmthat does not require the assumptionthat all devices start during the same round. This new version has a similar time complexity (the rounds required differ only by a polylogarithmic factor),but now requires slightly larger advertisement tags.

在本文中，我们研究了移动电话模型中领导者选举的基本问题:最近引入的经典电话模型的一个变体，该模型经过修改，以更好地描述许多流行的智能手机操作系统中实现的本地点对点通信服务。更详细地说，移动电话模型与经典电话模型在三个方面不同:(1)每个设备每轮最多只能参与一个连接;(2)网络拓扑可以经历一个参数化的变化速率;(3)在连接尝试开始之前，设备可以在每轮中向其邻居通告一个参数化的比特数。我们首先描述并分析了该模型中一个新的领导者选举算法，该算法在最苛刻的参数假设下工作:最大的拓扑变化率和无广告位。然后，我们应用这一结果来解决[Ghaffari, 2016]中关于在这些条件下推拉式谣言传播效率的开放性问题。然后我们将注意力转向稍微简单的情况，即设备可以在每轮中发布一个比特。我们证明了这些0位和1位情况在时间复杂度上有很大的差距。更详细地，我们描述和分析了一种新的算法，该算法解决了包含参数边界拓扑变化的时间复杂度的领导者选举。对于该参数的所有值，该算法都比之前的结果快，并且随着参数的增加，差距迅速增长(表明变化率较低)。我们通过描述和分析该算法的修改版本来结束，该版本不需要假设所有设备都在同一轮启动。这个新版本具有类似的时间复杂度(所需的轮数仅因多对数因子而不同)，但现在需要稍微大一点的广告标签。

{"title":"Leader Election in a Smartphone Peer-to-Peer Network","authors":"Calvin C. Newport","doi":"10.1109/IPDPS.2017.11","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.11","url":null,"abstract":"In this paper, we study the fundamental problem of leader election in the mobile telephone model: a recently introduced variation of the classical telephone model modified to better describe the local peer-to-peercommunication services implemented in many popular smartphone operating systems. In more detail, the mobile telephone model differs from the classical telephone model in three ways: (1) each devicecan participate in at most one connection per round; (2) the network topology can undergo a parameterizedrate of change; and (3) devices can advertise a parameterized number of bits to their neighbors in each round before connection attempts are initiated. We begin by describing and analyzing a new leader election algorithm in this model that works under the harshest possible parameter assumptions: maximum rate of topology changes and no advertising bits. We then apply this result to resolve an open question from [Ghaffari, 2016] on the efficiency of PUSH-PULL rumor spreading under these conditions. We then turn our attention to the slightly easier case where devices can advertise a single bit in each round. We demonstrate a large gap in time complexity between these zero bit and one bit cases. In more detail, we describe and analyze a new algorithm that solves leader election with a time complexitythat includes the parameter bounding topology changes. For all values of this parameter, this algorithm is faster than the previous result, with a gap that grows quickly as the parameter increases (indicating lower rates of change). We conclude by describing and analyzing a modified version of this algorithmthat does not require the assumptionthat all devices start during the same round. This new version has a similar time complexity (the rounds required differ only by a polylogarithmic factor),but now requires slightly larger advertisement tags.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123270272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Automatic-Signal Monitors with Multi-object Synchronization 具有多对象同步的自动信号监视器

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.57

W. Hung, V. Garg

Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution where threads should signal other threads, (3) use global locks or perform busy waiting for operations that depend upon a condition that spans multiple objects. Transactional memory systems eliminate the need for explicit locks, but do not support conditional synchronization. They also require the ability to rollback transactions. In this paper, we propose new monitor based methods that provide automatic signaling for global conditions that span multiple objects. Our system provides automatic notification for global conditions. Assuming that the global condition is a Boolean expression of local predicates, our method allows efficient monitoring of the conditions without any need for global locks. Furthermore, our system solves the monitor composition problem without requiring global locks. We have implemented our constructs on top of Java and have evaluated their overhead. Our results show that on most of the test cases, not only our code is simpler but also faster than Java's reentrant- lock as well as the Deuce transactional memory system.

当前基于监视器的系统在多目标操作方面存在一些不足。它们要求程序员(1)手动确定锁定操作的顺序，(2)手动确定线程应该向其他线程发出信号的执行点，(3)使用全局锁或执行依赖于跨越多个对象的条件的繁忙等待操作。事务性内存系统消除了显式锁的需要，但不支持条件同步。它们还需要能够回滚事务。在本文中，我们提出了新的基于监视器的方法，为跨越多个对象的全局条件提供自动信令。我们的系统为全局条件提供自动通知。假设全局条件是局部谓词的布尔表达式，我们的方法允许在不需要全局锁的情况下有效地监视条件。此外，我们的系统在不需要全局锁的情况下解决了监视器组合问题。我们已经在Java之上实现了我们的结构，并计算了它们的开销。我们的结果表明，在大多数测试用例中，我们的代码不仅比Java的可重入锁和Deuce事务内存系统更简单，而且速度更快。

{"title":"Automatic-Signal Monitors with Multi-object Synchronization","authors":"W. Hung, V. Garg","doi":"10.1109/IPDPS.2017.57","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.57","url":null,"abstract":"Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution where threads should signal other threads, (3) use global locks or perform busy waiting for operations that depend upon a condition that spans multiple objects. Transactional memory systems eliminate the need for explicit locks, but do not support conditional synchronization. They also require the ability to rollback transactions. In this paper, we propose new monitor based methods that provide automatic signaling for global conditions that span multiple objects. Our system provides automatic notification for global conditions. Assuming that the global condition is a Boolean expression of local predicates, our method allows efficient monitoring of the conditions without any need for global locks. Furthermore, our system solves the monitor composition problem without requiring global locks. We have implemented our constructs on top of Java and have evaluated their overhead. Our results show that on most of the test cases, not only our code is simpler but also faster than Java's reentrant- lock as well as the Deuce transactional memory system.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125354307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1