首页 > 最新文献

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)最新文献

英文 中文
Dataflow Programming for Stream Processing 流处理的数据流编程
Marcos Paulo Rocha, F. França, A. S. Nery, Leandro S. Guedes
Stream processing applications have high-demanding performance requirements that are hard to tackle using traditional parallel models on modern many-core architectures, such as GPUs. On the other hand, recent dataflow computing models can naturally exploit parallelism for a wide class of applications. This work presents an extension to an existing dataflow library for Java. The library extension implements high-level constructs with multiple command queues to enable the superposition of memory operations and kernel executions on GPUs. Experimental results show that significant speedup can be achieved for a subset of well-known stream processing applications: Volume Ray-Casting, Path-Tracing and Sobel Filter.
流处理应用程序具有高要求的性能要求,很难在现代多核架构(如gpu)上使用传统的并行模型来解决。另一方面,最近的数据流计算模型可以很自然地为大量应用程序利用并行性。这项工作提供了对现有Java数据流库的扩展。该库扩展实现了具有多个命令队列的高级结构,以便在gpu上实现内存操作和内核执行的叠加。实验结果表明,对于一些众所周知的流处理应用,如体射线投射、路径跟踪和索贝尔滤波,该算法可以实现显著的加速。
{"title":"Dataflow Programming for Stream Processing","authors":"Marcos Paulo Rocha, F. França, A. S. Nery, Leandro S. Guedes","doi":"10.1109/SBAC-PADW.2017.26","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.26","url":null,"abstract":"Stream processing applications have high-demanding performance requirements that are hard to tackle using traditional parallel models on modern many-core architectures, such as GPUs. On the other hand, recent dataflow computing models can naturally exploit parallelism for a wide class of applications. This work presents an extension to an existing dataflow library for Java. The library extension implements high-level constructs with multiple command queues to enable the superposition of memory operations and kernel executions on GPUs. Experimental results show that significant speedup can be achieved for a subset of well-known stream processing applications: Volume Ray-Casting, Path-Tracing and Sobel Filter.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122615046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Pathfinding Co-Processors for FPGAs fpga的高效寻路协处理器
A. S. Nery, A. Sena, Leandro S. Guedes
Pathfinding algorithms are at the heart of several classes of applications, such as network appliances (routing), GPS navigation and autonomous cars, which are related to recent trends in Artificial Intelligence and Internet of Things (IoT). Moreover, advances in semiconductor miniaturization technologies have enabled the design of efficient Systems-on-Chip (SoC) devices, with demanding performance requirements and energy consumption constraints. Such systems might include Field Programmable Gate Arrays (FPGAs) to allow the design of customized co-processors that yield lower power consumption and higher performance. Therefore, this work aims at designing and evaluating four efficient pathfinding co-processors, each one implementing a different well-known pathfinding algorithm: breadth-first, dijkstra, greedy and a-star. Each co-processor is designed using Xilinx High-Level Synthesis (HLS) compiler and is implemented in the programming logic of a Xilinx FPGA embedded with an ARM microprocessor, which is in charge of controlling the set of co-processors. Extensive performance, circuit-area and energy consumption results shows that each co-processor can efficiently execute a pathfinding algorithm, paving the way for novel dedicated accelerators.
寻路算法是几类应用的核心,例如网络设备(路由)、GPS导航和自动驾驶汽车,这些应用与人工智能和物联网(IoT)的最新趋势有关。此外,半导体小型化技术的进步使设计高效的片上系统(SoC)器件成为可能,但要求苛刻的性能和能耗限制。这样的系统可能包括现场可编程门阵列(fpga),允许设计定制的协处理器,从而产生更低的功耗和更高的性能。因此,本工作旨在设计和评估四种高效的寻路协处理器,每个协处理器实现不同的知名寻路算法:宽度优先,dijkstra,贪婪和a-star。每个协处理器使用Xilinx High-Level Synthesis (HLS)编译器设计,并在Xilinx FPGA的编程逻辑中实现,FPGA内嵌ARM微处理器,负责控制协处理器的集合。广泛的性能、电路面积和能耗结果表明,每个协处理器都可以有效地执行寻路算法,为新型专用加速器铺平了道路。
{"title":"Efficient Pathfinding Co-Processors for FPGAs","authors":"A. S. Nery, A. Sena, Leandro S. Guedes","doi":"10.1109/SBAC-PADW.2017.25","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.25","url":null,"abstract":"Pathfinding algorithms are at the heart of several classes of applications, such as network appliances (routing), GPS navigation and autonomous cars, which are related to recent trends in Artificial Intelligence and Internet of Things (IoT). Moreover, advances in semiconductor miniaturization technologies have enabled the design of efficient Systems-on-Chip (SoC) devices, with demanding performance requirements and energy consumption constraints. Such systems might include Field Programmable Gate Arrays (FPGAs) to allow the design of customized co-processors that yield lower power consumption and higher performance. Therefore, this work aims at designing and evaluating four efficient pathfinding co-processors, each one implementing a different well-known pathfinding algorithm: breadth-first, dijkstra, greedy and a-star. Each co-processor is designed using Xilinx High-Level Synthesis (HLS) compiler and is implemented in the programming logic of a Xilinx FPGA embedded with an ARM microprocessor, which is in charge of controlling the set of co-processors. Extensive performance, circuit-area and energy consumption results shows that each co-processor can efficiently execute a pathfinding algorithm, paving the way for novel dedicated accelerators.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"Volume 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124431825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Automatic Scan Parallelization in OpenMP OpenMP中的自动扫描并行化
Maicol Zegarra, M. Pereira, X. Martorell, G. Araújo
Prefix Scan (or simply scan) is an operator that computes all the partial sums of a vector. A scan operation results in a vector where each element is the sum of the preceding elements in the original vector up to the corresponding position. Scan is a key operation in many relevant problems like sorting, lexical analysis, string comparison, image filtering among others. Although there are libraries that provide hand-parallelized implementations of scan in CUDA and OpenCL, no automatic parallelization solution exists for this operator in OpenMP. This paper proposes a new clause for OpenMP which enables the automatic synthesis of the parallel scan. By using the proposed clause a programmer can considerably reduce the complexity of designing scan based algorithms, thus allowing he or she to focus the attention on the problem and not on learning new parallel programming models or languages. Scan was designed in AClang, an open-source LLVM/Clang compiler framework that implements the recently released OpenMP 4.X Accelerator Programming Model. Experiments running a set of typical scan based algorithms on NVIDIA, Intel, and ARM GPUs reveal that the performance of the proposed OpenMP clause is equivalent to that achieved when using OpenCL library calls, with the advantage of a simpler programming complexity.
前缀扫描(或简称Scan)是一个运算符,用于计算向量的所有部分和。扫描操作产生一个向量,其中每个元素是原向量中前面元素的和,直到相应的位置。扫描是许多相关问题的关键操作,如排序、词法分析、字符串比较、图像过滤等。虽然有一些库在CUDA和OpenCL中提供了扫描的手动并行实现,但在OpenMP中没有此操作符的自动并行解决方案。本文提出了一个新的OpenMP条款,实现了并行扫描的自动合成。通过使用建议的条款,程序员可以大大降低设计基于扫描的算法的复杂性,从而使他或她能够将注意力集中在问题上,而不是学习新的并行编程模型或语言。Scan是在AClang中设计的,AClang是一个开源的LLVM/Clang编译器框架,实现了最近发布的openmp4。X加速器编程模型。在NVIDIA、Intel和ARM gpu上运行一组典型的基于扫描的算法的实验表明,所提出的OpenMP子句的性能与使用OpenCL库调用时的性能相当,并且具有更简单的编程复杂性的优势。
{"title":"Automatic Scan Parallelization in OpenMP","authors":"Maicol Zegarra, M. Pereira, X. Martorell, G. Araújo","doi":"10.1109/SBAC-PADW.2017.23","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.23","url":null,"abstract":"Prefix Scan (or simply scan) is an operator that computes all the partial sums of a vector. A scan operation results in a vector where each element is the sum of the preceding elements in the original vector up to the corresponding position. Scan is a key operation in many relevant problems like sorting, lexical analysis, string comparison, image filtering among others. Although there are libraries that provide hand-parallelized implementations of scan in CUDA and OpenCL, no automatic parallelization solution exists for this operator in OpenMP. This paper proposes a new clause for OpenMP which enables the automatic synthesis of the parallel scan. By using the proposed clause a programmer can considerably reduce the complexity of designing scan based algorithms, thus allowing he or she to focus the attention on the problem and not on learning new parallel programming models or languages. Scan was designed in AClang, an open-source LLVM/Clang compiler framework that implements the recently released OpenMP 4.X Accelerator Programming Model. Experiments running a set of typical scan based algorithms on NVIDIA, Intel, and ARM GPUs reveal that the performance of the proposed OpenMP clause is equivalent to that achieved when using OpenCL library calls, with the advantage of a simpler programming complexity.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132919374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Case Study of Performance Optimization in a Heterogeneous Environment 异构环境下的性能优化案例研究
Leandro Pereira, C. Bentes, Maria Clicia Stelling de Castro, E. Garcia
The optimization of legacy codes for fully exploiting the parallelism opportunities provided by modern heterogeneous architectures is a difficult task. Multiple levels of parallelism can be exploited in order to gain the expected performance. This work describes the lessons learned in the performance optimization of a real-world reservoir engineering application composed of thousands of code lines. We study the exploitation of the multiple levels of parallelism, showing a possible, although non-trivial, path to extract performance. Our results show that exploiting thread-level parallelism is not always the best path to derive performance gains. On the other side, vectorization plays a key role in reducing the execution time of the application.
为了充分利用现代异构体系结构提供的并行机会而对遗留代码进行优化是一项艰巨的任务。为了获得预期的性能,可以利用多层并行性。这项工作描述了由数千行代码组成的实际油藏工程应用程序的性能优化经验教训。我们研究了多层并行性的利用,展示了一种可能的(尽管不是微不足道的)提取性能的途径。我们的结果表明,利用线程级并行性并不总是获得性能提升的最佳途径。另一方面,向量化在减少应用程序的执行时间方面起着关键作用。
{"title":"A Case Study of Performance Optimization in a Heterogeneous Environment","authors":"Leandro Pereira, C. Bentes, Maria Clicia Stelling de Castro, E. Garcia","doi":"10.1109/SBAC-PADW.2017.11","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.11","url":null,"abstract":"The optimization of legacy codes for fully exploiting the parallelism opportunities provided by modern heterogeneous architectures is a difficult task. Multiple levels of parallelism can be exploited in order to gain the expected performance. This work describes the lessons learned in the performance optimization of a real-world reservoir engineering application composed of thousands of code lines. We study the exploitation of the multiple levels of parallelism, showing a possible, although non-trivial, path to extract performance. Our results show that exploiting thread-level parallelism is not always the best path to derive performance gains. On the other side, vectorization plays a key role in reducing the execution time of the application.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"689 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127684573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HPSM: A Programming Framework for Multi-CPU and Multi-GPU Systems HPSM:一个多cpu和多gpu系统的编程框架
J. F. Lima, D. D. Domenico
This paper presents a high-level C++ framework to explore multi-CPU and multi-GPU systems called HPSM. HPSM enables parallel loops and reductions implemented over three parallel backends: Serial, OpenMP (with GCC and libKOMP runtime), and StarPU. We evaluated HPSM development effort with AXPY program, and performance with three parallel benchmarks: N-Body, Hotspot, and CFD solver. The CPU-GPU combination attained better performance than only GPUs for all cases on a CPU-GPU system. Still, our findings provide evidence that NUMA affinity at framework level may produce different results.
本文提出了一个高级c++框架,用于探索多cpu和多gpu系统,称为HPSM。HPSM支持在三个并行后端上实现并行循环和缩减:Serial, OpenMP(使用GCC和libKOMP运行时)和StarPU。我们使用AXPY程序评估了HPSM的开发工作,并使用三个并行基准:N-Body、Hotspot和CFD求解器来评估性能。在CPU-GPU系统的所有情况下,CPU-GPU组合都比仅使用gpu获得更好的性能。尽管如此,我们的研究结果提供了证据,表明在框架水平上NUMA的亲和力可能产生不同的结果。
{"title":"HPSM: A Programming Framework for Multi-CPU and Multi-GPU Systems","authors":"J. F. Lima, D. D. Domenico","doi":"10.1109/SBAC-PADW.2017.14","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.14","url":null,"abstract":"This paper presents a high-level C++ framework to explore multi-CPU and multi-GPU systems called HPSM. HPSM enables parallel loops and reductions implemented over three parallel backends: Serial, OpenMP (with GCC and libKOMP runtime), and StarPU. We evaluated HPSM development effort with AXPY program, and performance with three parallel benchmarks: N-Body, Hotspot, and CFD solver. The CPU-GPU combination attained better performance than only GPUs for all cases on a CPU-GPU system. Still, our findings provide evidence that NUMA affinity at framework level may produce different results.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124561873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Communication Protocol for Fog Computing Based on Network Coding Applied to Wireless Sensors 基于网络编码的雾计算通信协议在无线传感器中的应用
B. Marques, I. M. Coelho, A. Sena, M. D. Castro
A communication protocol for fog computing should be efficient, lightweight and customizable. In this work we focus in a communication protocol for fog nodes composed of wireless sensors, which are spatially distributed autonomous sensors monitoring physical or environmental conditions. Problems with data congestion and limited physical resources are common in these networks. For the optimization of data flow, it is important to apply techniques that reduce the transmitted data. We use the network coding technique to demonstrate through experiments the degree of efficiency of data transmission optimization protocols. The experiments were performed through a wireless sensors programming framework composed of TinyOS operating system, NesC programming language and TOSSIM simulator. In addition, we use the Python programming language to simulate the wireless sensor network topology. The results obtained demonstrate a better performance (50% up to 60%) when the network coding technique is applied to the data communication protocol.
用于雾计算的通信协议应该是高效、轻量级和可定制的。在这项工作中,我们专注于由无线传感器组成的雾节点的通信协议,无线传感器是空间分布的自主传感器,监测物理或环境条件。数据拥塞和物理资源有限的问题在这些网络中很常见。为了优化数据流,采用减少传输数据量的技术是很重要的。利用网络编码技术,通过实验验证了数据传输优化协议的效率程度。实验通过TinyOS操作系统、NesC编程语言和TOSSIM模拟器组成的无线传感器编程框架进行。此外,我们使用Python编程语言模拟了无线传感器网络的拓扑结构。实验结果表明,将网络编码技术应用到数据通信协议中,性能可以达到50% ~ 60%。
{"title":"A Communication Protocol for Fog Computing Based on Network Coding Applied to Wireless Sensors","authors":"B. Marques, I. M. Coelho, A. Sena, M. D. Castro","doi":"10.1109/SBAC-PADW.2017.27","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.27","url":null,"abstract":"A communication protocol for fog computing should be efficient, lightweight and customizable. In this work we focus in a communication protocol for fog nodes composed of wireless sensors, which are spatially distributed autonomous sensors monitoring physical or environmental conditions. Problems with data congestion and limited physical resources are common in these networks. For the optimization of data flow, it is important to apply techniques that reduce the transmitted data. We use the network coding technique to demonstrate through experiments the degree of efficiency of data transmission optimization protocols. The experiments were performed through a wireless sensors programming framework composed of TinyOS operating system, NesC programming language and TOSSIM simulator. In addition, we use the Python programming language to simulate the wireless sensor network topology. The results obtained demonstrate a better performance (50% up to 60%) when the network coding technique is applied to the data communication protocol.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128282177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Assessing Sparse Triangular Linear System Solvers on GPUs 在gpu上评估稀疏三角形线性系统求解器
Daniel Erguiz, Ernesto Dufrechu, P. Ezzatti
An important number of Numerical Linear Algebra methods to tackle problems in diverse fields of science and engineering, rely heavily on the solution of one or many sparse triangular linear systems. Since the early years, this has motivated numerous efforts that seek to produce efficientimplementations of this kernel for most hardware platforms. However, this operation implies strong data dependencies and unbalanced computations that difficult the concurrency, specially when massively-parallel processors such as GPUs are employed. In this work we review the different techniques to expose the data parallelism in this operation with specialattention to the many-core based proposals. Additionally, we experimentally evaluate the two most successful approaches, namely the routine that is included in CUSPARSE library and the synchronization free method of W. Liu et al. [1]. Finally, we advance in the characterization of the triangular sparse linear systems to select the best solver in each case.
在科学和工程的各个领域中,许多重要的数值线性代数方法都依赖于对一个或多个稀疏三角形线性系统的求解。从早期开始,这就激发了大量的努力,试图为大多数硬件平台生成这个内核的有效实现。然而,这种操作意味着强大的数据依赖性和不平衡的计算,这给并发性带来了困难,特别是在使用gpu等大规模并行处理器时。在这项工作中,我们回顾了在此操作中暴露数据并行性的不同技术,并特别关注基于多核的建议。此外,我们实验评估了两种最成功的方法,即CUSPARSE库中包含的例程和W. Liu等人[1]的无同步方法。最后,我们提出了三角稀疏线性系统的表征,以选择每种情况下的最佳解算器。
{"title":"Assessing Sparse Triangular Linear System Solvers on GPUs","authors":"Daniel Erguiz, Ernesto Dufrechu, P. Ezzatti","doi":"10.1109/SBAC-PADW.2017.15","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.15","url":null,"abstract":"An important number of Numerical Linear Algebra methods to tackle problems in diverse fields of science and engineering, rely heavily on the solution of one or many sparse triangular linear systems. Since the early years, this has motivated numerous efforts that seek to produce efficientimplementations of this kernel for most hardware platforms. However, this operation implies strong data dependencies and unbalanced computations that difficult the concurrency, specially when massively-parallel processors such as GPUs are employed. In this work we review the different techniques to expose the data parallelism in this operation with specialattention to the many-core based proposals. Additionally, we experimentally evaluate the two most successful approaches, namely the routine that is included in CUSPARSE library and the synchronization free method of W. Liu et al. [1]. Finally, we advance in the characterization of the triangular sparse linear systems to select the best solver in each case.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115658789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Automatic Partitioning of Stencil Computations on Heterogeneous Systems 异构系统中模板计算的自动划分
Alyson D. Pereira, Rodrigo C. O. Rocha, Luiz E. Ramos, M. Castro, L. F. Góes
The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on GPUs. However, most of the runtime systems that execute those applications often fail to fully utilize the parallelism of modern heterogeneous systems. In this paper, we propose a mechanism based on machine learning that automatically partitions stencil computations across CPU and GPU. We implemented it into the PSkel framework and found that the mechanism can boost the performance of stencil applications on average by 17.9x compared to their sequential CPU-only counterparts, by 1.34x compared to a GPU-only version, and by 1.48x compared to a parallel CPU-only version.
模板模式在许多科学和工程领域都很重要,引起了研究人员和工业界的极大兴趣。近年来,针对gpu上运行的并行模板应用程序提出了各种优化方案。然而,大多数执行这些应用程序的运行时系统往往不能充分利用现代异构系统的并行性。在本文中,我们提出了一种基于机器学习的机制,可以在CPU和GPU之间自动划分模板计算。我们将其实现到PSkel框架中,并发现该机制可以将模板应用程序的性能平均提高17.9倍,与仅顺序cpu版本相比提高1.34倍,与仅gpu版本相比提高1.48倍。
{"title":"Automatic Partitioning of Stencil Computations on Heterogeneous Systems","authors":"Alyson D. Pereira, Rodrigo C. O. Rocha, Luiz E. Ramos, M. Castro, L. F. Góes","doi":"10.1109/SBAC-PADW.2017.16","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2017.16","url":null,"abstract":"The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on GPUs. However, most of the runtime systems that execute those applications often fail to fully utilize the parallelism of modern heterogeneous systems. In this paper, we propose a mechanism based on machine learning that automatically partitions stencil computations across CPU and GPU. We implemented it into the PSkel framework and found that the mechanism can boost the performance of stencil applications on average by 17.9x compared to their sequential CPU-only counterparts, by 1.34x compared to a GPU-only version, and by 1.48x compared to a parallel CPU-only version.","PeriodicalId":325990,"journal":{"name":"2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122793829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1