Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores最新文献

Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs 评估OpenMP卸载到gpu的一对一并行度映射

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026945

Chen Shen, Xiaonan Tian, Dounia Khaldi, B. Chapman

The proliferation of accelerators in modern clusters makes efficient coprocessor programming a key requirement if application codes are to achieve high levels of performance with acceptable energy consumption on such platforms. This has led to considerable effort to provide suitable programming models for these accelerators, especially within the OpenMP community. While OpenMP 4.5 offers a rich set of directives, clauses and runtime calls to fully utilize accelerators, an efficient implementation of OpenMP 4.5 for GPUs remains a non-trivial task, given their multiple levels of thread parallelism. In this paper, we describe a new implementation of the corresponding features of OpenMP 4.5 for GPUs based on a one-to-one mapping of its loop hierarchy parallelism to the GPU thread hierarchy. We assess the impact of this mapping, in particular the use of GPU warps to handle innermost loop execution, on the performance of GPU execution via a set of benchmarks that include a version of the NAS parallel benchmarks specifically developed for this research; we also used the Matrix-Matrix multiplication, Jacobi, Gauss and Laplacian kernels.

如果应用程序代码要在这样的平台上以可接受的能耗实现高水平的性能，那么现代集群中加速器的激增使得高效的协处理器编程成为一个关键要求。这导致了为这些加速器提供合适的编程模型的大量工作，特别是在OpenMP社区中。虽然OpenMP 4.5提供了一组丰富的指令、子句和运行时调用来充分利用加速器，但考虑到gpu的多线程并行性，OpenMP 4.5的有效实现仍然是一项重要的任务。在本文中，我们描述了基于循环层次并行性与GPU线程层次的一对一映射的openmp4.5 GPU相应特性的新实现。我们通过一组基准评估了这种映射的影响，特别是使用GPU扭曲来处理最内层循环的执行，这些基准包括专门为本研究开发的NAS并行基准的一个版本;我们还使用了矩阵-矩阵乘法，雅可比，高斯和拉普拉斯核。

{"title":"Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs","authors":"Chen Shen, Xiaonan Tian, Dounia Khaldi, B. Chapman","doi":"10.1145/3026937.3026945","DOIUrl":"https://doi.org/10.1145/3026937.3026945","url":null,"abstract":"The proliferation of accelerators in modern clusters makes efficient coprocessor programming a key requirement if application codes are to achieve high levels of performance with acceptable energy consumption on such platforms. This has led to considerable effort to provide suitable programming models for these accelerators, especially within the OpenMP community. While OpenMP 4.5 offers a rich set of directives, clauses and runtime calls to fully utilize accelerators, an efficient implementation of OpenMP 4.5 for GPUs remains a non-trivial task, given their multiple levels of thread parallelism. In this paper, we describe a new implementation of the corresponding features of OpenMP 4.5 for GPUs based on a one-to-one mapping of its loop hierarchy parallelism to the GPU thread hierarchy. We assess the impact of this mapping, in particular the use of GPU warps to handle innermost loop execution, on the performance of GPU execution via a set of benchmarks that include a version of the NAS parallel benchmarks specifically developed for this research; we also used the Matrix-Matrix multiplication, Jacobi, Gauss and Laplacian kernels.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134208418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms 异构平台上具有高级任务的并行应用程序开发框架

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026946

Chao Liu, M. Leeser

Traditional widely used parallel programming models and methods focus on data distribution and are suitable for implementing data parallelism. They lack the abstraction of task parallelism and make it inconvenient to separate the applications' high level structure from low level implementation and execution. To improve this, we propose a parallel programming framework based on the tasks and conduits (TNC) model. In this framework, we provide tasks and conduits as the basic components to construct applications at a higher level. Users can easily implement coarse-grained task parallelism with multiple tasks running concurrently. When running on different platforms, the application main structure can stay the same and only adapt task implementations based on the target platforms, improving maintenance and portability of parallel programs. For a single task, we provide multiple levels of shared memory concepts, allowing users to implement fine grained data parallelism through groups of threads across multiple nodes. This provides users a flexible and efficient means to implement parallel applications. By extending the framework runtime system, it is able to launch and run GPU tasks to make use of GPUs for acceleration. The support of both CPU tasks and GPU tasks helps users develop and run parallel applications on heterogeneous platforms. To demonstrate the use of our framework, we tested it with some kernel applications. The results show that the applications' performance using our framework is comparable to traditional programming methods. Further, with the use of GPU tasks, we can easily adjust the program to leverage GPUs for acceleration. In our tests, a single GPU's performance is comparable to a 4 node multicore CPU cluster.

传统广泛使用的并行编程模型和方法侧重于数据分布，适合于实现数据并行。它们缺乏任务并行性的抽象，使得将应用程序的高层结构与低层实现和执行分开变得不方便。为了改进这一点，我们提出了一个基于任务和管道(TNC)模型的并行编程框架。在这个框架中，我们提供任务和管道作为在更高层次上构建应用程序的基本组件。用户可以通过并发运行多个任务轻松实现粗粒度的任务并行性。在不同平台上运行时，应用程序的主体结构保持不变，只适应目标平台上的任务实现，提高了并行程序的可维护性和可移植性。对于单个任务，我们提供了多个级别的共享内存概念，允许用户通过跨多个节点的线程组实现细粒度的数据并行性。这为用户提供了实现并行应用程序的灵活而有效的方法。通过扩展框架运行时系统，可以启动和运行GPU任务，利用GPU进行加速。同时支持CPU任务和GPU任务，帮助用户在异构平台上开发和运行并行应用程序。为了演示框架的使用，我们用一些内核应用程序对其进行了测试。结果表明，使用该框架的应用程序的性能与传统编程方法相当。此外，通过使用GPU任务，我们可以轻松调整程序以利用GPU进行加速。在我们的测试中，单个GPU的性能与4节点多核CPU集群相当。

{"title":"A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms","authors":"Chao Liu, M. Leeser","doi":"10.1145/3026937.3026946","DOIUrl":"https://doi.org/10.1145/3026937.3026946","url":null,"abstract":"Traditional widely used parallel programming models and methods focus on data distribution and are suitable for implementing data parallelism. They lack the abstraction of task parallelism and make it inconvenient to separate the applications' high level structure from low level implementation and execution. To improve this, we propose a parallel programming framework based on the tasks and conduits (TNC) model. In this framework, we provide tasks and conduits as the basic components to construct applications at a higher level. Users can easily implement coarse-grained task parallelism with multiple tasks running concurrently. When running on different platforms, the application main structure can stay the same and only adapt task implementations based on the target platforms, improving maintenance and portability of parallel programs. For a single task, we provide multiple levels of shared memory concepts, allowing users to implement fine grained data parallelism through groups of threads across multiple nodes. This provides users a flexible and efficient means to implement parallel applications. By extending the framework runtime system, it is able to launch and run GPU tasks to make use of GPUs for acceleration. The support of both CPU tasks and GPU tasks helps users develop and run parallel applications on heterogeneous platforms. To demonstrate the use of our framework, we tested it with some kernel applications. The results show that the applications' performance using our framework is comparable to traditional programming methods. Further, with the use of GPU tasks, we can easily adjust the program to leverage GPUs for acceleration. In our tests, a single GPU's performance is comparable to a 4 node multicore CPU cluster.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133029204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors 非对称多核处理器上对称特征问题的三对角化简

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026938

P. Alonso, Sandra Catalán, J. Herrero, E. S. Quintana‐Ortí, Rafael Rodríguez-Sánchez

Asymmetric multicore processors (AMPs), as those present in ARM big.LITTLE technology, have been proposed as a means to address the end of Dennard power scaling law. The idea of these architectures is to activate only the type (and number) of cores that satisfy the quality of service requested by the application(s) in execution while delivering high energy efficiency. For dense linear algebra problems though, performance is of paramount importance, asking for an efficient use of all computational resources in the AMP. In response to this, we investigate how to exploit the asymmetric cores of an ARMv7 big.LITTLE AMP in order to attain high performance for the reduction to tridiagonal form, an essential step towards the solution of dense symmetric eigenvalue problems. The routine for this purpose in LAPACK is especially challenging, since half of its floating-point arithmetic operations (flops) are cast in terms of compute-bound kernels while the remaining half correspond to memory-bound kernels. To deal with this scenario: 1) we leverage a tuned implementation of the compute-bound kernels for AMPs; 2) we develop and parallelize new architecture-aware micro-kernels for the memory-bound kernels; 3) and we carefully adjust the type and number of cores to use at each step of the reduction procedure.

非对称多核处理器(amp)，就像ARM中的那些大处理器一样。LITTLE技术，已被提出作为解决登纳德幂标度定律终结的一种手段。这些体系结构的思想是，在提供高能效的同时，仅激活满足应用程序在执行中所请求的服务质量的核心类型(和数量)。然而，对于密集线性代数问题，性能是至关重要的，要求有效利用AMP中的所有计算资源。为此，我们研究了如何利用ARMv7大处理器的非对称内核。为了获得高性能的简化到三对角线形式，这是解决密集对称特征值问题的重要一步。LAPACK中用于此目的的例程尤其具有挑战性，因为其一半的浮点算术运算(flops)是根据计算绑定的内核进行强制转换的，而其余一半则对应于内存绑定的内核。为了处理这种情况:1)我们利用amp的计算绑定内核的优化实现;2)针对内存约束内核，开发并并行化新的架构感知微内核;3)我们仔细调整在每一步的减少过程中使用的芯的类型和数量。

{"title":"Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors","authors":"P. Alonso, Sandra Catalán, J. Herrero, E. S. Quintana‐Ortí, Rafael Rodríguez-Sánchez","doi":"10.1145/3026937.3026938","DOIUrl":"https://doi.org/10.1145/3026937.3026938","url":null,"abstract":"Asymmetric multicore processors (AMPs), as those present in ARM big.LITTLE technology, have been proposed as a means to address the end of Dennard power scaling law. The idea of these architectures is to activate only the type (and number) of cores that satisfy the quality of service requested by the application(s) in execution while delivering high energy efficiency. For dense linear algebra problems though, performance is of paramount importance, asking for an efficient use of all computational resources in the AMP. In response to this, we investigate how to exploit the asymmetric cores of an ARMv7 big.LITTLE AMP in order to attain high performance for the reduction to tridiagonal form, an essential step towards the solution of dense symmetric eigenvalue problems. The routine for this purpose in LAPACK is especially challenging, since half of its floating-point arithmetic operations (flops) are cast in terms of compute-bound kernels while the remaining half correspond to memory-bound kernels. To deal with this scenario: 1) we leverage a tuned implementation of the compute-bound kernels for AMPs; 2) we develop and parallelize new architecture-aware micro-kernels for the memory-bound kernels; 3) and we carefully adjust the type and number of cores to use at each step of the reduction procedure.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132430026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views 面向可组合GPU编程:用动态动作和惰性视图编程GPU

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026942

Michael Haidl, Michel Steuwer, H. Dirks, Tim Humernbrum, S. Gorlatch

In this paper, we advocate a composable approach to programming systems with Graphics Processing Units (GPU): programs are developed as compositions of generic, reusable patterns. Current GPU programming approaches either rely on low-level, monolithic code without patterns (CUDA and OpenCL), which achieves high performance at the cost of cumbersome and error-prone programming, or they improve the programmability by using pattern-based abstractions (e.g., Thrust) but pay a performance penalty due to inefficient implementations of pattern composition. We develop an API for GPUs based programming on C++ with STL-style patterns and its compiler-based implementation. Our API gives the application developers the native C++ means (views and actions) to specify precisely which pattern compositions should be automatically fused during code generation into a single efficient GPU kernel, thereby ensuring a high target performance. We implement our approach by extending the range-v3 library which is currently being developed for the forthcoming C++ standards. The composable programming in our approach is done exclusively in the standard C++14, with STL algorithms used as patterns which we re-implemented in parallel for GPU. Our compiler implementation is based on the LLVM and Clang frameworks, and we use advanced multi-stage programming techniques for aggressive runtime optimizations. We experimentally evaluate our approach using a set of benchmark applications and a real-world case study from the area of image processing. Our codes achieve performance competitive with CUDA monolithic implementations, and we outperform pattern-based codes written using Nvidia's Thrust.

在本文中，我们提倡使用图形处理单元(GPU)编程系统的可组合方法:将程序开发为通用的可重用模式的组合。当前的GPU编程方法要么依赖于低级的、没有模式的单片代码(CUDA和OpenCL)，以繁琐和易出错的编程为代价来实现高性能，要么通过使用基于模式的抽象(例如，Thrust)来提高可编程性，但由于模式组合的低效实现而付出性能损失。我们开发了一个基于stl风格的c++编程的gpu API及其基于编译器的实现。我们的API为应用程序开发人员提供了本地c++方法(视图和操作)，以精确地指定在代码生成过程中应该自动将哪些模式组合融合到单个高效的GPU内核中，从而确保高目标性能。我们通过扩展range-v3库来实现我们的方法，该库目前正在为即将到来的c++标准开发。我们的方法中的可组合编程完全在标准c++ 14中完成，使用STL算法作为模式，我们为GPU并行重新实现。我们的编译器实现基于LLVM和Clang框架，我们使用先进的多阶段编程技术进行积极的运行时优化。我们使用一组基准应用程序和来自图像处理领域的实际案例研究对我们的方法进行了实验评估。我们的代码实现了与CUDA单片实现竞争的性能，并且我们优于使用Nvidia的Thrust编写的基于模式的代码。

{"title":"Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views","authors":"Michael Haidl, Michel Steuwer, H. Dirks, Tim Humernbrum, S. Gorlatch","doi":"10.1145/3026937.3026942","DOIUrl":"https://doi.org/10.1145/3026937.3026942","url":null,"abstract":"In this paper, we advocate a composable approach to programming systems with Graphics Processing Units (GPU): programs are developed as compositions of generic, reusable patterns. Current GPU programming approaches either rely on low-level, monolithic code without patterns (CUDA and OpenCL), which achieves high performance at the cost of cumbersome and error-prone programming, or they improve the programmability by using pattern-based abstractions (e.g., Thrust) but pay a performance penalty due to inefficient implementations of pattern composition. We develop an API for GPUs based programming on C++ with STL-style patterns and its compiler-based implementation. Our API gives the application developers the native C++ means (views and actions) to specify precisely which pattern compositions should be automatically fused during code generation into a single efficient GPU kernel, thereby ensuring a high target performance. We implement our approach by extending the range-v3 library which is currently being developed for the forthcoming C++ standards. The composable programming in our approach is done exclusively in the standard C++14, with STL algorithms used as patterns which we re-implemented in parallel for GPU. Our compiler implementation is based on the LLVM and Clang frameworks, and we use advanced multi-stage programming techniques for aggressive runtime optimizations. We experimentally evaluate our approach using a set of benchmark applications and a real-world case study from the area of image processing. Our codes achieve performance competitive with CUDA monolithic implementations, and we outperform pattern-based codes written using Nvidia's Thrust.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"269 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123446599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs gpu上块jacobi预调节器生成的批处理高斯-乔丹消去

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026940

H. Anzt, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí

In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.

在本文中，我们设计并评估了图形处理单元(gpu)上有效生成块雅可比预调节器的例程。具体而言，为了利用图形加速器的架构，我们开发了一个用于矩阵反演的批处理高斯-乔丹消去CUDA内核，该内核嵌入了隐式旋转技术，并在GPU寄存器中处理整个反演过程。此外，我们整合了提取和插入CUDA内核，快速建立了块jacobi预调节器。我们的实验将我们的实现与MAGMA库中的一系列批处理例程的性能进行了比较，这些例程通过带部分pivot的LU分解实现了反演。此外，我们使用来自SuiteSparse矩阵集合的各种稀疏矩阵，评估了块jacobi提取和插入步骤的不同策略的成本。最后，我们在应用于一组计算科学问题的迭代求解器的背景下评估完整块Jacobi预条件生成的效率，并量化其优于标量Jacobi预条件的好处。

{"title":"Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs","authors":"H. Anzt, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí","doi":"10.1145/3026937.3026940","DOIUrl":"https://doi.org/10.1145/3026937.3026940","url":null,"abstract":"In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner. Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129085314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

PETRAS: Performance, Energy and Thermal Aware Resource Allocation and Scheduling for Heterogeneous Systems 异构系统的性能、能量和热感知资源分配和调度

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026944

Shouq Alsubaihi, J. Gaudiot

Many computing systems today are heterogeneous in that they consist of a mix of different types of processing units (e.g., CPUs, GPUs). Each of these processing units has different execution capabilities and energy consumption characteristics. Job mapping and scheduling play a crucial role in such systems as they strongly affect the overall system performance, energy consumption, peak power and peak temperature. Allocating resources (e.g., core scaling, threads allocation) is another challenge since different sets of resources exhibit different behavior in terms of performance and energy consumption. Many studies have been conducted on job scheduling with an eye on performance improvement. However, few of them takes into account both performance and energy. We thus propose our novel Performance, Energy and Thermal aware Resource Allocator and Scheduler (PETRAS) which combines job mapping, core scaling, and threads allocation into one scheduler. Since job mapping and scheduling are known to be NP-hard problems, we apply an evolutionary algorithm called a Genetic Algorithm (GA) to find an efficient job schedule in terms of execution time and energy consumption, under peak power and peak temperature constraints. Experiments conducted on an actual system equipped with a multicore CPU and a GPU show that PETRAS finds efficient schedules in terms of execution time and energy consumption. Compared to performance-based GA and other schedulers, on average, PETRAS scheduler can achieve up to a 4.7x of speedup and an energy saving of up to 195%.

今天的许多计算系统都是异构的，因为它们由不同类型的处理单元(例如，cpu、gpu)混合组成。每一个处理单元都有不同的执行能力和能耗特征。作业映射和调度在此类系统中起着至关重要的作用，因为它们强烈影响系统的整体性能、能耗、峰值功率和峰值温度。分配资源(例如，核心扩展、线程分配)是另一个挑战，因为不同的资源集在性能和能耗方面表现出不同的行为。许多关于工作调度的研究都着眼于绩效的提高。然而，很少有人同时考虑到性能和能量。因此，我们提出了一种新颖的性能，能源和热感知资源分配器和调度器(PETRAS)，它将作业映射，核心缩放和线程分配结合到一个调度器中。由于作业映射和调度是已知的np困难问题，我们应用一种称为遗传算法(GA)的进化算法，在峰值功率和峰值温度约束下，根据执行时间和能耗找到有效的作业调度。在配备多核CPU和GPU的实际系统上进行的实验表明，PETRAS在执行时间和能耗方面找到了高效的调度。与基于性能的遗传算法和其他调度器相比，平均而言，PETRAS调度器可以实现高达4.7倍的加速和高达195%的节能。

{"title":"PETRAS: Performance, Energy and Thermal Aware Resource Allocation and Scheduling for Heterogeneous Systems","authors":"Shouq Alsubaihi, J. Gaudiot","doi":"10.1145/3026937.3026944","DOIUrl":"https://doi.org/10.1145/3026937.3026944","url":null,"abstract":"Many computing systems today are heterogeneous in that they consist of a mix of different types of processing units (e.g., CPUs, GPUs). Each of these processing units has different execution capabilities and energy consumption characteristics. Job mapping and scheduling play a crucial role in such systems as they strongly affect the overall system performance, energy consumption, peak power and peak temperature. Allocating resources (e.g., core scaling, threads allocation) is another challenge since different sets of resources exhibit different behavior in terms of performance and energy consumption. Many studies have been conducted on job scheduling with an eye on performance improvement. However, few of them takes into account both performance and energy. We thus propose our novel Performance, Energy and Thermal aware Resource Allocator and Scheduler (PETRAS) which combines job mapping, core scaling, and threads allocation into one scheduler. Since job mapping and scheduling are known to be NP-hard problems, we apply an evolutionary algorithm called a Genetic Algorithm (GA) to find an efficient job schedule in terms of execution time and energy consumption, under peak power and peak temperature constraints. Experiments conducted on an actual system equipped with a multicore CPU and a GPU show that PETRAS finds efficient schedules in terms of execution time and energy consumption. Compared to performance-based GA and other schedulers, on average, PETRAS scheduler can achieve up to a 4.7x of speedup and an energy saving of up to 195%.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116319422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

TaskInsight: Understanding Task Schedules Effects on Memory and Performance tasksight:了解任务调度对内存和性能的影响

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026943

G. Ceballos, Thomas Grass, Andra Hugo, D. Black-Schaffer

Recent scheduling heuristics for task-based applications have managed to improve their by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provide insights into why, and where, different schedulers improve memory behavior, and how this is related to the applications' performance. To address this, we present TaskInsight, a technique to characterize the memory behavior of different task schedulers through the analysis of data reuse between tasks. TaskInsight provides high-level, quantitative information that can be correlated with tasks' performance variation over time to understand data reuse through the caches due to scheduling choices. TaskInsight is useful to diagnose and identify which scheduling decisions affected performance, when were they taken, and why the performance changed, both in single and multi-threaded executions. We demonstrate how TaskInsight can diagnose examples where poor scheduling caused over 10% difference in performance for tasks of the same type, due to changes in the tasks' data reuse through the private and shared caches, in single and multi-threaded executions of the same application. This flexible insight is key for optimization in many contexts, including data locality, throughput, memory footprint or even energy efficiency.

最近针对基于任务的应用程序的调度启发式算法通过考虑数据局部性和缓存共享等与内存相关的属性，成功地改进了它们的性能。但是，仍然普遍缺乏能够深入了解不同调度器为何以及在何处改进内存行为，以及这与应用程序性能之间的关系的工具。为了解决这个问题，我们提出了TaskInsight，这是一种通过分析任务之间的数据重用来描述不同任务调度器的内存行为的技术。TaskInsight提供了与任务性能随时间变化相关的高级定量信息，以了解由于调度选择而通过缓存进行的数据重用。在单线程和多线程执行中，TaskInsight可用于诊断和确定哪些调度决策会影响性能、何时执行以及性能变化的原因。在同一个应用程序的单线程和多线程执行中，由于通过私有和共享缓存的任务数据重用的变化，糟糕的调度导致相同类型任务的性能差异超过10%，我们演示了TaskInsight如何诊断这些示例。这种灵活的洞察力是许多上下文中优化的关键，包括数据位置、吞吐量、内存占用甚至能源效率。

{"title":"TaskInsight: Understanding Task Schedules Effects on Memory and Performance","authors":"G. Ceballos, Thomas Grass, Andra Hugo, D. Black-Schaffer","doi":"10.1145/3026937.3026943","DOIUrl":"https://doi.org/10.1145/3026937.3026943","url":null,"abstract":"Recent scheduling heuristics for task-based applications have managed to improve their by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provide insights into why, and where, different schedulers improve memory behavior, and how this is related to the applications' performance. To address this, we present TaskInsight, a technique to characterize the memory behavior of different task schedulers through the analysis of data reuse between tasks. TaskInsight provides high-level, quantitative information that can be correlated with tasks' performance variation over time to understand data reuse through the caches due to scheduling choices. TaskInsight is useful to diagnose and identify which scheduling decisions affected performance, when were they taken, and why the performance changed, both in single and multi-threaded executions. We demonstrate how TaskInsight can diagnose examples where poor scheduling caused over 10% difference in performance for tasks of the same type, due to changes in the tasks' data reuse through the private and shared caches, in single and multi-threaded executions of the same application. This flexible insight is key for optimization in many contexts, including data locality, throughput, memory footprint or even energy efficiency.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121338394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

High Performance Detection of Strongly Connected Components in Sparse Graphs on GPUs gpu上稀疏图强连通分量的高性能检测

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026941

Pingfan Li, Xuhao Chen, Jie Shen, Jianbin Fang, T. Tang, Canqun Yang

Detecting strongly connected components (SCC) has been broadly used in many real-world applications. To speedup SCC detection for large-scale graphs, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations are able to get speedup on synthetic graph instances, but show limited performance when applied to large-scale real-world datasets. In this paper, we present a parallel SCC detection implementation on GPUs that achieves high performance on both synthetic and real-world graphs. We use a hybrid method that divides the algorithm into two phases. Our method is able to dynamically change parallelism strategies to maximize performance for each algorithm phase. We then orchestrates the graph traversal kernel with customized strategy for each phase, and employ algorithm extensions to handle the serialization problem caused by irregular graph properties. Our design is carefully implemented to take advantage of the GPU hardware. Evaluation with diverse graphs on the NVIDIA K20c GPU shows that our proposed implementation achieves an average speedup of 5.0x over the serial Tarjan's algorithm. It also outperforms the existing OpenMP implementation with a speedup of 1.4x.

检测强连接组件(SCC)已广泛应用于许多实际应用中。为了加速大规模图形的SCC检测，已经提出了利用现代gpu的并行算法。现有的GPU实现能够在合成图实例上获得加速，但在应用于大规模真实数据集时表现出有限的性能。在本文中，我们提出了一个gpu上的并行SCC检测实现，该实现在合成图和真实图上都实现了高性能。我们使用一种混合方法，将算法分为两个阶段。我们的方法能够动态改变并行策略，以最大化每个算法阶段的性能。然后，我们为每个阶段使用定制策略编排图遍历内核，并使用算法扩展来处理由不规则图属性引起的序列化问题。我们的设计是精心实现的，以充分利用GPU硬件。在NVIDIA K20c GPU上对不同图形进行评估表明，我们提出的实现比串行Tarjan算法实现了5.0倍的平均加速。它也比现有的OpenMP实现的速度提高了1.4倍。

{"title":"High Performance Detection of Strongly Connected Components in Sparse Graphs on GPUs","authors":"Pingfan Li, Xuhao Chen, Jie Shen, Jianbin Fang, T. Tang, Canqun Yang","doi":"10.1145/3026937.3026941","DOIUrl":"https://doi.org/10.1145/3026937.3026941","url":null,"abstract":"Detecting strongly connected components (SCC) has been broadly used in many real-world applications. To speedup SCC detection for large-scale graphs, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations are able to get speedup on synthetic graph instances, but show limited performance when applied to large-scale real-world datasets. In this paper, we present a parallel SCC detection implementation on GPUs that achieves high performance on both synthetic and real-world graphs. We use a hybrid method that divides the algorithm into two phases. Our method is able to dynamically change parallelism strategies to maximize performance for each algorithm phase. We then orchestrates the graph traversal kernel with customized strategy for each phase, and employ algorithm extensions to handle the serialization problem caused by irregular graph properties. Our design is carefully implemented to take advantage of the GPU hardware. Evaluation with diverse graphs on the NVIDIA K20c GPU shows that our proposed implementation achieves an average speedup of 5.0x over the serial Tarjan's algorithm. It also outperforms the existing OpenMP implementation with a speedup of 1.4x.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115371173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A high-performance portable abstract interface for explicit SIMD vectorization 用于显式SIMD矢量化的高性能可移植抽象接口

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 2017-02-04 DOI: 10.1145/3026937.3026939

Przemyslaw Karpinski, John McDonald

This work establishes a scalable, easy to use and efficient approach for exploiting SIMD capabilities of modern CPUs, without the need for extensive knowledge of architecture specific instruction sets. We provide a description of a new API, known as UME::SIMD, which provides a flexible, portable, type-oriented abstraction for SIMD instruction set architectures. Requirements for such libraries are analysed based on existing, as well as proposed future solutions. A software architecture that achieves these requirements is explained, and its performance evaluated. Finally we discuss how the API fits into the existing, and future software ecosystem.

这项工作为开发现代cpu的SIMD功能建立了一个可扩展的、易于使用的和有效的方法，而不需要对架构特定指令集有广泛的了解。我们提供了一个称为UME::SIMD的新API的描述，它为SIMD指令集体系结构提供了一个灵活的、可移植的、面向类型的抽象。对这些库的需求是基于现有的以及提出的未来解决方案进行分析的。解释了实现这些需求的软件体系结构，并对其性能进行了评估。最后，我们讨论了API如何适应现有的和未来的软件生态系统。

引用次数: 22

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores 第八届多核与多核编程模型与应用国际研讨会论文集

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pub Date : 1900-01-01 DOI: 10.1145/3026937

引用次数: 3