2015 International Conference on Parallel Architecture and Compilation (PACT)最新文献_第4页

An Efficient Vectorization Approach to Nested Thread-level Parallelism for CUDA GPUs 一种有效的CUDA gpu嵌套线程级并行的矢量化方法

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.56

Shixiong Xu, David Gregg

Nested thread-level parallelism (TLP) is pervasive in real applications. For example, 75% (14 out of 19) of the applications in the Rodinia benchmark for heterogeneous accelerators contain kernels with nested thread-level parallelism. Efficiently mapping the enclosed nested parallelism to the GPU threads in the C-to-CUDA compilation (OpenACC in this paper) is becoming more and more important. This mapping problem is two folds: suitable execution models and efficient mapping strategies of the nested parallelism.

嵌套线程级并行(TLP)在实际应用程序中非常普遍。例如，在针对异构加速器的Rodinia基准测试中，75%(19个中的14个)应用程序包含嵌套线程级并行性的内核。在C-to-CUDA编译(本文称为OpenACC)中，将封闭的嵌套并行性有效地映射到GPU线程变得越来越重要。这个映射问题包括两个方面:合适的执行模型和有效的嵌套并行映射策略。

引用次数: 0

Using Compiler Techniques to Improve Automatic Performance Modeling 使用编译器技术改进自动性能建模

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.39

Arnamoy Bhattacharyya, Grzegorz Kwasniewski, T. Hoefler

Performance modeling can be utilized in a number of scenarios, starting from finding performance bugs to the scalability study of applications. Existing dynamic and static approaches for automating the generation of performance models have limitations for precision and overhead. In this work, we explore combination of a number of static and dynamic analyses for life-long performance modeling and investigate accuracy, reduction of the model search space, and performance improvements over previous approaches on a wide range of parallel benchmarks. We develop static and dynamic schemes such as kernel clustering, batched model updates and regulation of modeling frequency for reducing the cost of measurements, model generation, and updates. Our hybrid approach, on average can improve the accuracy of the performance models by 4.3%(maximum 10%) and can reduce the overhead by 25% (maximum 65%) as compared to previous approaches.

性能建模可以在许多场景中使用，从查找性能错误到研究应用程序的可伸缩性。用于自动生成性能模型的现有动态和静态方法在精度和开销方面存在限制。在这项工作中，我们探索了终身性能建模的许多静态和动态分析的组合，并在广泛的并行基准上研究了准确性，模型搜索空间的减少以及对先前方法的性能改进。我们开发了静态和动态方案，如核聚类、批量模型更新和建模频率调节，以减少测量、模型生成和更新的成本。与以前的方法相比，我们的混合方法平均可以将性能模型的准确性提高4.3%(最大10%)，并可以将开销减少25%(最大65%)。

引用次数: 26

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance 利用warp间异构性提高GPGPU性能

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.38

Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, G. Loh, C. Das, M. Kandemir, O. Mutlu

In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. This divergence stalls the warp, as it cannot execute the next instruction until all requests from the current instruction complete. In this work, we make three new observations. First, GPGPU warps exhibit heterogeneous memory divergence behavior at the shared cache: some warps have most of their requests hit in the cache (high cache utility), while other warps see most of their request miss (low cache utility). Second, a warp retains the same divergence behavior for long periods of execution. Third, due to high memory level parallelism, requests going to the shared cache can incur queuing delays as large as hundreds of cycles, exacerbating the effects of memory divergence. We propose a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing. MeDiC uses warp divergence characterization to guide three components: (1) a cache bypassing mechanism that exploits the latency tolerance of low cache utility warps to both alleviate queuing delay and increase the hit rate for high cache utility warps, (2) a cache insertion policy that prevents data from highcache utility warps from being prematurely evicted, and (3) a memory controller that prioritizes the few requests received from high cache utility warps to minimize stall time. We compare MeDiC to four cache management techniques, and find that it delivers an average speedup of 21.8%, and 20.1% higher energy efficiency, over a state-of-the-art GPU cache management mechanism across 15 different GPGPU applications.

在GPU中，warp中的所有线程都以同步的方式执行相同的指令。对于内存指令，这可能导致内存分歧:一些线程的内存请求会提前得到服务，而其余的请求则会导致长时间的延迟。这种分歧会使warp停止运行，因为在当前指令的所有请求完成之前，它无法执行下一条指令。在这项工作中，我们有三个新的观察。首先，GPGPU warp在共享缓存中表现出异构内存发散行为:一些warp的大多数请求都在缓存中命中(高缓存效用)，而其他warp的大多数请求都未命中(低缓存效用)。第二，翘曲在长时间的执行中保持相同的发散行为。第三，由于内存级别的高并行性，进入共享缓存的请求可能导致长达数百个周期的队列延迟，从而加剧了内存分歧的影响。我们提出了一套技术，统称为内存发散校正(MeDiC)，以减少内存发散和缓存排队对性能的负面影响。MeDiC使用翘曲发散特性来指导三个组件:(1)缓存绕过机制，利用低缓存实用程序翘曲的延迟容忍来减轻排队延迟并增加高缓存实用程序翘曲的命中率，(2)缓存插入策略，防止来自高缓存实用程序翘曲的数据过早被驱逐，以及(3)内存控制器，优先处理从高缓存实用程序翘曲接收的少数请求，以最大限度地减少失速时间。我们将MeDiC与四种缓存管理技术进行了比较，发现在15种不同的GPGPU应用程序中，与最先进的GPU缓存管理机制相比，它的平均加速速度提高了21.8%，能效提高了20.1%。

{"title":"Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance","authors":"Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, G. Loh, C. Das, M. Kandemir, O. Mutlu","doi":"10.1109/PACT.2015.38","DOIUrl":"https://doi.org/10.1109/PACT.2015.38","url":null,"abstract":"In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. This divergence stalls the warp, as it cannot execute the next instruction until all requests from the current instruction complete. In this work, we make three new observations. First, GPGPU warps exhibit heterogeneous memory divergence behavior at the shared cache: some warps have most of their requests hit in the cache (high cache utility), while other warps see most of their request miss (low cache utility). Second, a warp retains the same divergence behavior for long periods of execution. Third, due to high memory level parallelism, requests going to the shared cache can incur queuing delays as large as hundreds of cycles, exacerbating the effects of memory divergence. We propose a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing. MeDiC uses warp divergence characterization to guide three components: (1) a cache bypassing mechanism that exploits the latency tolerance of low cache utility warps to both alleviate queuing delay and increase the hit rate for high cache utility warps, (2) a cache insertion policy that prevents data from highcache utility warps from being prematurely evicted, and (3) a memory controller that prioritizes the few requests received from high cache utility warps to minimize stall time. We compare MeDiC to four cache management techniques, and find that it delivers an average speedup of 21.8%, and 20.1% higher energy efficiency, over a state-of-the-art GPU cache management mechanism across 15 different GPGPU applications.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129945695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Energy-Efficient Hybrid DRAM/NVM Main Memory 高能效混合DRAM/NVM主存

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.58

Ahmad Hassan, H. Vandierendonck, Dimitrios S. Nikolopoulos

DRAM consumes significant static energy both in active and idle state due to continuous leakage and refresh power. Various byte-addressable non-volatile memory (NVM) technologies promise near-zero static energy and persistence, however they suffer from increased latency and increased dynamic energy than DRAM. A hybrid main memory, containing both DRAM and NVM components, can provide both low energy and high performance although such organizations require that data is placed in the appropriate component. We propose a user-level software management methodology for a hybrid DRAM/NVM main memory system with an aim to reduce energy.

DRAM在活动和空闲状态下，由于持续的泄漏和刷新功率，都会消耗大量的静态能量。各种字节可寻址非易失性存储器(NVM)技术承诺接近于零的静态能量和持久性，但是与DRAM相比，它们的延迟和动态能量增加。包含DRAM和NVM组件的混合主存储器可以提供低能耗和高性能，尽管此类组织要求将数据放置在适当的组件中。我们提出了一种用于混合DRAM/NVM主存储系统的用户级软件管理方法，旨在降低能耗。

引用次数: 15

Scalable SIMD-Efficient Graph Processing on GPUs gpu上的可扩展simd高效图形处理

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.15

Farzad Khorasani, Rajiv Gupta, L. Bhuyan

The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertex-centric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC uses the CSR representation) or device utilization (e.g., CuSha uses the CW representation, however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs' outbox buffers for dramatically reducing inter-GPU data transfer volume. Whereas existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to inter-GPU vertex communication schemes used by other multi-GPU techniques (i.e., Medusa, TOTEM).

gpu的巨大计算能力使其成为加速大规模数据并行计算(如流行的图形处理应用程序)的有吸引力的平台。然而，现实世界幂律图固有的不规则性和大尺寸使得gpu的有效利用成为一个主要挑战。在本文中，我们开发的技术大大提高了gpu上以顶点为中心的图形处理的性能和可扩展性。首先，我们提出了Warp Segmentation，这是一种新的方法，通过动态分配适当数量的SIMD线程来处理具有不规则大小邻居的顶点，同时采用紧凑的CSR表示来最大化可以保存在GPU全局内存中的图形大小，从而大大提高GPU设备利用率。先前的工作可以最大化图的大小(VWC使用CSR表示)或设备利用率(例如，CuSha使用CW表示，然而，CW大约是CSR大小的2.5倍)。其次，我们进一步扩展图形处理以利用多个gpu，同时提出顶点细化，以解决通过PCIe总线在gpu之间传输数据时明智地使用有限带宽的挑战。顶点优化采用并行二进制前缀和来动态收集gpu发件箱缓冲区内更新的边界顶点，从而显著减少gpu间的数据传输量。而现有的多gpu技术(Medusa, TOTEM)执行高度浪费的顶点传输。在单个GPU上，我们的框架比VWC提供了1.29到2.80倍的平均速度。当扩展到多个gpu时，与其他多gpu技术(例如，Medusa, TOTEM)使用的gpu间顶点通信方案相比，我们的框架实现了高达2.71倍的性能提升。

{"title":"Scalable SIMD-Efficient Graph Processing on GPUs","authors":"Farzad Khorasani, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2015.15","DOIUrl":"https://doi.org/10.1109/PACT.2015.15","url":null,"abstract":"The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertex-centric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC uses the CSR representation) or device utilization (e.g., CuSha uses the CW representation, however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs' outbox buffers for dramatically reducing inter-GPU data transfer volume. Whereas existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to inter-GPU vertex communication schemes used by other multi-GPU techniques (i.e., Medusa, TOTEM).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115305645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 109

Exploiting Staleness for Approximating Loads on CMPs 利用陈旧性来近似cmp上的负载

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.27

Prasanna Venkatesh Rengasamy, A. Sivasubramaniam, M. Kandemir, C. Das

Coherence misses are an important factor in limiting the scalability of multi-threaded shared memory applications on chip multiprocessors (CMPs) that are envisaged to contain dozens of cores in the imminent future. This paper proposes a novel approach to tackling this problem by leveraging the growingly important paradigm of approximate computing. Many applications are either tolerant to slight errors in the output or if stringent, have in-built resiliency to tolerate some errors in the execution. The approximate computing paradigm suggests breaking conventional barriers of mandating stringent correctness on the hardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applications in the SPLASH-2 benchmark suite, we note that nearly all these applications have such inherent resiliency and/or tolerance to slight errors in the output. Based on this observation, we propose to approximate coherence-related load misses by returning stale values, i.e., the version at the time of the invalidation. We show that returning such values from the invalidated lines already present in d-L1 offers only limited scope for improvement since those lines get evicted fairly soon due to the high pressure on d-L1. Instead, we propose a very small (8 lines) Stale Victim Cache (SVC), to hold such lines upon d-L1 eviction. While this does offer significant improvement, there is the possibility of data getting very stale in such a structure, making it highly sensitive to the choice of what data to keep, and for how long. To address these concerns, we propose to time-out these lines from the SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup in some SPLASH-2 applications, with an average speedup between 10-15% across the entire suite, becoming comparable to an ideal execution that does not incur coherence misses. Further, the consequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, because of inherent application resilience, in eleven applications, and the maximum error was at most 0.08% across the entire suite.

相干缺失是限制芯片多处理器(cmp)上多线程共享内存应用程序可扩展性的一个重要因素，这些应用程序预计在不久的将来包含数十个内核。本文提出了一种利用日益重要的近似计算范式来解决这个问题的新方法。许多应用程序要么容忍输出中的轻微错误，要么具有内置的弹性来容忍执行中的一些错误。近似计算范式建议打破在硬件上强制执行严格正确性的传统障碍，在性能-功率-可靠性设计空间中允许更大的灵活性。以SPLASH-2基准测试套件中的多线程应用程序为例，我们注意到几乎所有这些应用程序都具有这种固有的弹性和/或对输出中的轻微错误的容忍度。基于这一观察，我们建议通过返回失效值(即失效时的版本)来近似一致性相关的加载缺失。我们表明，从d-L1中已经存在的无效行返回这样的值只提供了有限的改进空间，因为由于d-L1上的高压，这些行很快就会被驱逐。相反，我们建议使用一个非常小的(8行)陈旧受害者缓存(SVC)，以便在d-L1驱逐时保存这些行。虽然这确实提供了显著的改进，但在这种结构中存在数据变得非常陈旧的可能性，这使得它对保留哪些数据以及保留多长时间的选择非常敏感。为了解决这些问题，我们建议在SVC+TB机制中暂停这些行，以限制它们的过期。我们表明，SVC+TB在一些SPLASH-2应用程序中提供了高达28.6%的加速，整个套件的平均加速在10-15%之间，与不导致一致性丢失的理想执行相当。此外，后续的近似对正确性的影响很小，允许它们全部完成。由于固有的应用程序弹性，在11个应用程序中没有出现错误，并且整个套件的最大错误最多为0.08%。

{"title":"Exploiting Staleness for Approximating Loads on CMPs","authors":"Prasanna Venkatesh Rengasamy, A. Sivasubramaniam, M. Kandemir, C. Das","doi":"10.1109/PACT.2015.27","DOIUrl":"https://doi.org/10.1109/PACT.2015.27","url":null,"abstract":"Coherence misses are an important factor in limiting the scalability of multi-threaded shared memory applications on chip multiprocessors (CMPs) that are envisaged to contain dozens of cores in the imminent future. This paper proposes a novel approach to tackling this problem by leveraging the growingly important paradigm of approximate computing. Many applications are either tolerant to slight errors in the output or if stringent, have in-built resiliency to tolerate some errors in the execution. The approximate computing paradigm suggests breaking conventional barriers of mandating stringent correctness on the hardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applications in the SPLASH-2 benchmark suite, we note that nearly all these applications have such inherent resiliency and/or tolerance to slight errors in the output. Based on this observation, we propose to approximate coherence-related load misses by returning stale values, i.e., the version at the time of the invalidation. We show that returning such values from the invalidated lines already present in d-L1 offers only limited scope for improvement since those lines get evicted fairly soon due to the high pressure on d-L1. Instead, we propose a very small (8 lines) Stale Victim Cache (SVC), to hold such lines upon d-L1 eviction. While this does offer significant improvement, there is the possibility of data getting very stale in such a structure, making it highly sensitive to the choice of what data to keep, and for how long. To address these concerns, we propose to time-out these lines from the SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup in some SPLASH-2 applications, with an average speedup between 10-15% across the entire suite, becoming comparable to an ideal execution that does not incur coherence misses. Further, the consequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, because of inherent application resilience, in eleven applications, and the maximum error was at most 0.08% across the entire suite.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127471037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Compiling and Optimizing Java 8 Programs for GPU Execution 编译和优化用于GPU执行的Java 8程序

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.46

K. Ishizaki, Akihiro Hayashi, Gita Koblents, Vivek Sarkar

GPUs can enable significant performance improvements for certain classes of data parallel applications and are widely used in recent computer systems. However, GPU execution currently requires explicit low-level operations such as 1) managing memory allocations and transfers between the host system and the GPU, 2) writing GPU kernels in a low-level programming model such as CUDA or OpenCL, and 3) optimizing the kernels by utilizing appropriate memory types on the GPU. Because of this complexity, in many cases, only expert programmers can exploit the computational capabilities of GPUs through the CUDA/OpenCL languages. This is unfortunate since a large number of programmers use high-level languages, such as Java, due to their advantages of productivity, safety, and platform portability, but would still like to exploit the performance benefits of GPUs. Thus, one challenging problem is how to utilize GPUs while allowing programmers to continue to benefit from the productivity advantages of languages like Java. This paper presents a just-in-time (JIT) compiler that can generate and optimize GPU code from a pure Java program written using lambda expressions with the new parallel streams APIs in Java 8. These APIs allow Java programmers to express data parallelism at a higher level than threads and tasks. Our approach translates lambda expressions with parallel streams APIs in Java 8 into GPU code and automatically generates runtime calls that handle the low-level operations mentioned above. Additionally, our optimization techniques 1) allocate and align the starting address of the Java array body in the GPUs with the memory transaction boundary to increase memory bandwidth, 2) utilize read-only cache for array accesses to increase memory efficiency in GPUs, and 3) eliminate redundant data transfer between the host and the GPU. The compiler also performs loop versioning for eliminating redundant exception checks and for supporting virtual method invocations within GPU kernels. These features and optimizations are supported and automatically performed by a JIT compiler that is built on top of a production version of the IBM Java 8 runtime environment. Our experimental results on an NVIDIA Tesla GPU show significant performance improvements over sequential execution (127.9 × geometric mean) and parallel execution (3.3 × geometric mean) for eight Java 8 benchmark programs running on a 160-thread POWER8 machine. This paper also includes an in-depth analysis of GPU execution to show the impact of our optimization techniques by selectively disabling each optimization. Our experimental results show a geometric-mean speed-up of 1.15 × in the GPU kernel over state-of-the-art approaches. Overall, our JIT compiler can improve the performance of Java 8 programs by automatically leveraging the computational capability of GPUs.

gpu可以为某些类型的数据并行应用程序提供显著的性能改进，并在最近的计算机系统中广泛使用。然而，GPU执行目前需要明确的低级操作，如1)管理内存分配和主机系统与GPU之间的传输，2)在低级编程模型(如CUDA或OpenCL)中编写GPU内核，以及3)通过在GPU上使用适当的内存类型来优化内核。由于这种复杂性，在许多情况下，只有专业程序员才能通过CUDA/OpenCL语言利用gpu的计算能力。这是不幸的，因为大量程序员使用高级语言，如Java，因为它们具有生产力、安全性和平台可移植性的优势，但仍然希望利用gpu的性能优势。因此，一个具有挑战性的问题是如何在允许程序员继续从Java等语言的生产力优势中获益的同时利用gpu。本文介绍了一个即时(JIT)编译器，它可以从使用lambda表达式编写的纯Java程序生成和优化GPU代码，并在Java 8中使用新的并行流api。这些api允许Java程序员在比线程和任务更高的层次上表达数据并行性。我们的方法将Java 8中带有并行流api的lambda表达式转换为GPU代码，并自动生成处理上述低级操作的运行时调用。此外，我们的优化技术1)在GPU中分配和对齐Java数组体的起始地址与内存事务边界，以增加内存带宽，2)利用只读缓存进行数组访问，以提高GPU的内存效率，3)消除主机和GPU之间的冗余数据传输。编译器还执行循环版本控制，以消除冗余的异常检查和支持GPU内核中的虚拟方法调用。这些特性和优化由构建在IBM Java 8运行时环境的生产版本之上的JIT编译器支持并自动执行。我们在NVIDIA Tesla GPU上的实验结果表明，在160线程的POWER8机器上运行的8个Java 8基准程序比顺序执行(127.9倍几何平均)和并行执行(3.3倍几何平均)有显著的性能改进。本文还包括对GPU执行的深入分析，通过选择性地禁用每个优化来展示我们的优化技术的影响。我们的实验结果表明，与最先进的方法相比，GPU内核的几何平均速度提高了1.15倍。总的来说，我们的JIT编译器可以通过自动利用gpu的计算能力来提高Java 8程序的性能。

{"title":"Compiling and Optimizing Java 8 Programs for GPU Execution","authors":"K. Ishizaki, Akihiro Hayashi, Gita Koblents, Vivek Sarkar","doi":"10.1109/PACT.2015.46","DOIUrl":"https://doi.org/10.1109/PACT.2015.46","url":null,"abstract":"GPUs can enable significant performance improvements for certain classes of data parallel applications and are widely used in recent computer systems. However, GPU execution currently requires explicit low-level operations such as 1) managing memory allocations and transfers between the host system and the GPU, 2) writing GPU kernels in a low-level programming model such as CUDA or OpenCL, and 3) optimizing the kernels by utilizing appropriate memory types on the GPU. Because of this complexity, in many cases, only expert programmers can exploit the computational capabilities of GPUs through the CUDA/OpenCL languages. This is unfortunate since a large number of programmers use high-level languages, such as Java, due to their advantages of productivity, safety, and platform portability, but would still like to exploit the performance benefits of GPUs. Thus, one challenging problem is how to utilize GPUs while allowing programmers to continue to benefit from the productivity advantages of languages like Java. This paper presents a just-in-time (JIT) compiler that can generate and optimize GPU code from a pure Java program written using lambda expressions with the new parallel streams APIs in Java 8. These APIs allow Java programmers to express data parallelism at a higher level than threads and tasks. Our approach translates lambda expressions with parallel streams APIs in Java 8 into GPU code and automatically generates runtime calls that handle the low-level operations mentioned above. Additionally, our optimization techniques 1) allocate and align the starting address of the Java array body in the GPUs with the memory transaction boundary to increase memory bandwidth, 2) utilize read-only cache for array accesses to increase memory efficiency in GPUs, and 3) eliminate redundant data transfer between the host and the GPU. The compiler also performs loop versioning for eliminating redundant exception checks and for supporting virtual method invocations within GPU kernels. These features and optimizations are supported and automatically performed by a JIT compiler that is built on top of a production version of the IBM Java 8 runtime environment. Our experimental results on an NVIDIA Tesla GPU show significant performance improvements over sequential execution (127.9 × geometric mean) and parallel execution (3.3 × geometric mean) for eight Java 8 benchmark programs running on a 160-thread POWER8 machine. This paper also includes an in-depth analysis of GPU execution to show the impact of our optimization techniques by selectively disabling each optimization. Our experimental results show a geometric-mean speed-up of 1.15 × in the GPU kernel over state-of-the-art approaches. Overall, our JIT compiler can improve the performance of Java 8 programs by automatically leveraging the computational capability of GPUs.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130684407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices 在多个设备上编排多个数据并行内核

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.14

Janghaeng Lee, M. Samadi, S. Mahlke

Traditionally, programmers and software tools have focused on mapping a single data-parallel kernel onto a heterogeneous computing system consisting of multiple general-purpose processors (CPUS) and graphics processing units (GPUs). These methodologies break down as application complexity grows to contain multiple communicating data-parallel kernels. This paper introduces MKMD, an automatic system for mapping multiple kernels across multiple computing devices in a seamless manner. MKMD is a two phased approach that combines coarse grain scheduling of indivisible kernels followed by opportunistic fine-grained workgroup-level partitioning to exploit idle resources. During this process, MKMD considers kernel dependencies and the underlying systems along with the execution time model built with a few sets of profile data. With the scheduling decision, MKMD transparently manages the order of executions and data transfers for each device. On a real machine with one CPU and two different GPUs, MKMD achieves a mean speedup of 1.89x compared to the in-order execution on the fastest device for a set of applications with multiple kernels. 53% of this speedup comes from the coarse-grained scheduling and the other 47% is the result of the fine-grained partitioning.

传统上，程序员和软件工具关注于将单个数据并行内核映射到由多个通用处理器(cpu)和图形处理单元(gpu)组成的异构计算系统。随着应用程序复杂性的增长，这些方法会分解为包含多个通信数据并行内核。本文介绍了MKMD，一个在多个计算设备上无缝映射多个内核的自动系统。MKMD是一种分两阶段的方法，它结合了不可分割内核的粗粒度调度和机会性的细粒度工作组级分区来利用空闲资源。在此过程中，MKMD考虑内核依赖关系和底层系统，以及使用几组配置文件数据构建的执行时间模型。通过调度决策，MKMD透明地管理每个设备的执行顺序和数据传输。在一台具有一个CPU和两个不同gpu的真实机器上，对于一组具有多个内核的应用程序，与在最快的设备上按顺序执行相比，MKMD实现了1.89倍的平均加速。这种加速的53%来自粗粒度调度，另外47%是细粒度分区的结果。

{"title":"Orchestrating Multiple Data-Parallel Kernels on Multiple Devices","authors":"Janghaeng Lee, M. Samadi, S. Mahlke","doi":"10.1109/PACT.2015.14","DOIUrl":"https://doi.org/10.1109/PACT.2015.14","url":null,"abstract":"Traditionally, programmers and software tools have focused on mapping a single data-parallel kernel onto a heterogeneous computing system consisting of multiple general-purpose processors (CPUS) and graphics processing units (GPUs). These methodologies break down as application complexity grows to contain multiple communicating data-parallel kernels. This paper introduces MKMD, an automatic system for mapping multiple kernels across multiple computing devices in a seamless manner. MKMD is a two phased approach that combines coarse grain scheduling of indivisible kernels followed by opportunistic fine-grained workgroup-level partitioning to exploit idle resources. During this process, MKMD considers kernel dependencies and the underlying systems along with the execution time model built with a few sets of profile data. With the scheduling decision, MKMD transparently manages the order of executions and data transfers for each device. On a real machine with one CPU and two different GPUs, MKMD achieves a mean speedup of 1.89x compared to the in-order execution on the fastest device for a set of applications with multiple kernels. 53% of this speedup comes from the coarse-grained scheduling and the other 47% is the result of the fine-grained partitioning.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128127632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Tracking and Reducing Uncertainty in Dataflow Analysis-Based Dynamic Parallel Monitoring 基于数据流分析的动态并行监控中的跟踪与减少不确定性

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.20

Michelle L. Goodstein, Phillip B. Gibbons, M. Kozuch, T. Mowry

Dataflow analysis-based dynamic parallel monitoring (DADPM) is a recent approach for identifying bugs in parallel software as it executes, based on the key insight of explicitly modeling a sliding window of uncertainty across parallel threads. While this makes the approach practical and scalable, it also introduces the possibility of false positives in the analysis. In this paper, we improve upon the DADPM framework through two observations. First, by explicitly tracking new “uncertain” states in the metadata lattice, we can distinguish potential false positives from true positives. Second, as the analysis tool runs dynamically, it can use the existence (or absence) of observed uncertain states to adjust the tradeoff between precision and performance on-the-fly. For example, we demonstrate how the epoch size parameter can be adjusted dynamically in response to uncertainty in order to achieve better performance and precision than when the tool is statically configured. This paper shows how to adapt a canonical dataflow analysis problem (reaching definitions) and a popular security monitoring tool (TAINTCHECK) to our new uncertainty-tracking framework, and provides new provable guarantees that reported true errors are now precise.

基于数据流分析的动态并行监控(DADPM)是一种最新的方法，用于在并行软件执行时识别bug，该方法基于对跨并行线程的不确定性滑动窗口的显式建模的关键见解。虽然这使得该方法具有实用性和可扩展性，但它也在分析中引入了误报的可能性。在本文中，我们通过两个观察来改进DADPM框架。首先，通过显式跟踪元数据晶格中新的“不确定”状态，我们可以区分潜在的假阳性和真阳性。其次，由于分析工具是动态运行的，它可以利用观察到的不确定状态的存在(或不存在)来实时调整精度和性能之间的权衡。例如，我们演示了如何动态调整epoch大小参数以响应不确定性，以获得比静态配置工具更好的性能和精度。本文展示了如何将规范数据流分析问题(达到定义)和流行的安全监控工具(TAINTCHECK)应用于我们的新不确定性跟踪框架，并提供了新的可证明保证，即报告的真实错误现在是精确的。

{"title":"Tracking and Reducing Uncertainty in Dataflow Analysis-Based Dynamic Parallel Monitoring","authors":"Michelle L. Goodstein, Phillip B. Gibbons, M. Kozuch, T. Mowry","doi":"10.1109/PACT.2015.20","DOIUrl":"https://doi.org/10.1109/PACT.2015.20","url":null,"abstract":"Dataflow analysis-based dynamic parallel monitoring (DADPM) is a recent approach for identifying bugs in parallel software as it executes, based on the key insight of explicitly modeling a sliding window of uncertainty across parallel threads. While this makes the approach practical and scalable, it also introduces the possibility of false positives in the analysis. In this paper, we improve upon the DADPM framework through two observations. First, by explicitly tracking new “uncertain” states in the metadata lattice, we can distinguish potential false positives from true positives. Second, as the analysis tool runs dynamically, it can use the existence (or absence) of observed uncertain states to adjust the tradeoff between precision and performance on-the-fly. For example, we demonstrate how the epoch size parameter can be adjusted dynamically in response to uncertainty in order to achieve better performance and precision than when the tool is statically configured. This paper shows how to adapt a canonical dataflow analysis problem (reaching definitions) and a popular security monitoring tool (TAINTCHECK) to our new uncertainty-tracking framework, and provides new provable guarantees that reported true errors are now precise.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"116 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132452067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Algorithmic Approach to Communication Reduction in Parallel Graph Algorithms 并行图算法中减少通信的一种算法

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.34

Harshvardhan, Adam Fidel, N. Amato, Lawrence Rauchwerger

Graph algorithms on distributed-memory systems typically perform heavy communication, often limiting their scalability and performance. This work presents an approach to transparently (without programmer intervention) allow fine-grained graph algorithms to utilize algorithmic communication reduction optimizations. In many graph algorithms, the same information is communicated by a vertex to its neighbors, which we coin algorithmic redundancy. Our approach exploits algorithmic redundancy to reduce communication between vertices located on different processing elements. We employ algorithm-aware coarsening of messages sent during vertex visitation, reducing both the number of messages and the absolute amount of communication in the system. To achieve this, the system structure is represented by a hierarchical graph, facilitating communication optimizations that can take into consideration the machine's memory hierarchy. We also present an optimization for small-world scale-free graphs wherein hub vertices (i.e., vertices of very large degree) are represented in a similar hierarchical manner, which is exploited to increase parallelism and reduce communication. Finally, we present a framework that transparently allows fine-grained graph algorithms to utilize our hierarchical approach without programmer intervention, while improving scalability and performance. Experimental results of our proposed approach on 131,000+ cores show improvements of up to a factor of 8 times over the non-hierarchical version for various graph mining and graph analytics algorithms.

分布式内存系统上的图算法通常执行大量通信，这通常限制了它们的可伸缩性和性能。这项工作提出了一种透明(无需程序员干预)允许细粒度图算法利用算法通信减少优化的方法。在许多图算法中，相同的信息由一个顶点传递给它的邻居，我们称之为算法冗余。我们的方法利用算法冗余来减少位于不同处理元素上的顶点之间的通信。我们对顶点访问期间发送的消息采用算法感知的粗化，减少了消息的数量和系统中通信的绝对数量。为了实现这一点，系统结构由一个分层图表示，便于考虑到机器内存层次结构的通信优化。我们还提出了一个小世界无标度图的优化，其中枢纽顶点(即非常大的顶点)以类似的分层方式表示，这被用来增加并行性和减少通信。最后，我们提出了一个框架，透明地允许细粒度图算法利用我们的分层方法，而无需程序员干预，同时提高可扩展性和性能。我们提出的方法在131,000多个核上的实验结果表明，对于各种图挖掘和图分析算法，非分层版本的改进高达8倍。

{"title":"An Algorithmic Approach to Communication Reduction in Parallel Graph Algorithms","authors":"Harshvardhan, Adam Fidel, N. Amato, Lawrence Rauchwerger","doi":"10.1109/PACT.2015.34","DOIUrl":"https://doi.org/10.1109/PACT.2015.34","url":null,"abstract":"Graph algorithms on distributed-memory systems typically perform heavy communication, often limiting their scalability and performance. This work presents an approach to transparently (without programmer intervention) allow fine-grained graph algorithms to utilize algorithmic communication reduction optimizations. In many graph algorithms, the same information is communicated by a vertex to its neighbors, which we coin algorithmic redundancy. Our approach exploits algorithmic redundancy to reduce communication between vertices located on different processing elements. We employ algorithm-aware coarsening of messages sent during vertex visitation, reducing both the number of messages and the absolute amount of communication in the system. To achieve this, the system structure is represented by a hierarchical graph, facilitating communication optimizations that can take into consideration the machine's memory hierarchy. We also present an optimization for small-world scale-free graphs wherein hub vertices (i.e., vertices of very large degree) are represented in a similar hierarchical manner, which is exploited to increase parallelism and reduce communication. Finally, we present a framework that transparently allows fine-grained graph algorithms to utilize our hierarchical approach without programmer intervention, while improving scalability and performance. Experimental results of our proposed approach on 131,000+ cores show improvements of up to a factor of 8 times over the non-hierarchical version for various graph mining and graph analytics algorithms.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133105447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5