2015 International Conference on Parallel Architecture and Compilation (PACT)最新文献_第5页

Scalable Task Scheduling and Synchronization Using Hierarchical Effects 使用分层效果的可伸缩任务调度和同步

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.25

Stephen Heumann, Alexandros Tzannes, Vikram S. Adve

Several concurrent programming models that give strong safety guarantees employ effect specifications that indicate what effects on shared state a piece of code may perform. These specifications can be much more expressive than traditional synchronization mechanisms like locks, and they are amenable to static and/or dynamic checking approaches for ensuring safety properties. The Tasks With Effects (TWE) programming model uses dynamic checking to give nearly the strongest safety guarantees of any existing shared memory language while providing the flexibility to express both structured and unstructured concurrency. Like several other systems, TWE's effect specifications use hierarchical memory regions, which can naturally model nested and modular data structures and allow effects to be expressed at different levels of granularity in different parts of a program. To implement a programming model like TWE with high performance, particularly for programs with many fine-grain tasks, the run-time task scheduler must employ an algorithm that can enforce task isolation (mutual exclusion of tasks with conflicting effects) with low overhead and high scalability. This paper describes such an algorithm for TWE. It uses a scheduling tree designed to take advantage of the hierarchical structure of TWE effects, obtaining two key properties that lead to high scalability: (a) effects need to be compared only for ancestor and descendant nodes in the tree, and not any other nodes, and (b) the scheduler can use fine-grain locking of tree nodes to enable highly concurrent scheduling operations. We prove formally that the algorithm guarantees task isolation. Experimental results with a range of programs show that the algorithm provides very good scalability, even with fine-grain tasks.

一些提供强安全性保证的并发编程模型使用效果规范来指示一段代码可能对共享状态执行的影响。这些规范可能比锁等传统同步机制更具表现力，并且它们适用于静态和/或动态检查方法，以确保安全属性。TWE编程模型使用动态检查来提供几乎所有现有共享内存语言中最强的安全保证，同时提供表达结构化和非结构化并发性的灵活性。像其他几个系统一样，TWE的效果规范使用分层内存区域，它可以自然地为嵌套和模块化数据结构建模，并允许在程序的不同部分以不同的粒度级别表示效果。要实现像TWE这样具有高性能的编程模型，特别是对于具有许多细粒度任务的程序，运行时任务调度器必须采用一种能够以低开销和高可伸缩性强制任务隔离(具有冲突效果的任务互斥)的算法。本文描述了一种TWE算法。它使用一个调度树来利用TWE效应的分层结构，获得两个关键属性，从而获得高可伸缩性:(a)只需要对树中的祖先和后代节点进行效果比较，而不需要对任何其他节点进行比较;(b)调度程序可以使用树节点的细粒度锁定来实现高度并发的调度操作。我们正式证明了该算法保证了任务隔离。一系列程序的实验结果表明，该算法即使在细粒度任务中也具有很好的可扩展性。

{"title":"Scalable Task Scheduling and Synchronization Using Hierarchical Effects","authors":"Stephen Heumann, Alexandros Tzannes, Vikram S. Adve","doi":"10.1109/PACT.2015.25","DOIUrl":"https://doi.org/10.1109/PACT.2015.25","url":null,"abstract":"Several concurrent programming models that give strong safety guarantees employ effect specifications that indicate what effects on shared state a piece of code may perform. These specifications can be much more expressive than traditional synchronization mechanisms like locks, and they are amenable to static and/or dynamic checking approaches for ensuring safety properties. The Tasks With Effects (TWE) programming model uses dynamic checking to give nearly the strongest safety guarantees of any existing shared memory language while providing the flexibility to express both structured and unstructured concurrency. Like several other systems, TWE's effect specifications use hierarchical memory regions, which can naturally model nested and modular data structures and allow effects to be expressed at different levels of granularity in different parts of a program. To implement a programming model like TWE with high performance, particularly for programs with many fine-grain tasks, the run-time task scheduler must employ an algorithm that can enforce task isolation (mutual exclusion of tasks with conflicting effects) with low overhead and high scalability. This paper describes such an algorithm for TWE. It uses a scheduling tree designed to take advantage of the hierarchical structure of TWE effects, obtaining two key properties that lead to high scalability: (a) effects need to be compared only for ancestor and descendant nodes in the tree, and not any other nodes, and (b) the scheduler can use fine-grain locking of tree nodes to enable highly concurrent scheduling operations. We prove formally that the algorithm guarantees task isolation. Experimental results with a range of programs show that the algorithm provides very good scalability, even with fine-grain tasks.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115207718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

RC3: Consistency Directed Cache Coherence for x86-64 with RC Extensions RC3:基于RC扩展的x86-64定向缓存一致性

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.37

M. Elver, V. Nagarajan

The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy coherence protocols without either: changing the architecture's consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3's storage overhead per cache line scales logarithmically with increasing core count, and reduces on-chip coherence storage overheads by 45% compared to a related approach specifically targeting TSO.

最近对基于编程语言的内存一致性模型的趋同引发了对延迟缓存一致性协议的新兴趣。这些协议利用同步信息，只在同步边界通过自我失效来强制一致性。实际上，这样的协议不需要共享器跟踪，这有利于可伸缩性。缺点是，这样的协议只容易适用于一组受限制的一致性模型，例如Release consistency (RC)，它显式地公开同步信息。特别是，具有更严格一致性模型的现有体系结构(例如x86-64)不能轻易地使用惰性一致性协议，除非采用以下两种方法:以向后兼容性为代价将体系结构的一致性模型更改为RC(一种变体)，或者调整协议以满足更严格的一致性模型，从而无法从同步信息中获益。我们展示了一种用于x86-64体系结构的方法，它是两者之间的折衷。首先，我们提出了一种通过简单的ISA扩展传递同步信息的机制，同时保留了与遗留代码和旧微体系结构的向后兼容性。其次，我们提出了RC3，一种可扩展的硬件缓存一致性协议，用于RCtso，由此产生的内存一致性模型。RC3不跟踪分享者，并且依赖于收购的自我失效。为了有效地满足RCtso，该协议仅使用每个l1时间戳来传递地减少自失效。RC3的性能比传统的惰性RC协议高出12%，在RC优化程序中实现了与MESI目录协议相当的性能。RC3的每条缓存线的存储开销随着核数的增加呈对数级增长，与专门针对TSO的相关方法相比，它将片上一致性存储开销降低了45%。

{"title":"RC3: Consistency Directed Cache Coherence for x86-64 with RC Extensions","authors":"M. Elver, V. Nagarajan","doi":"10.1109/PACT.2015.37","DOIUrl":"https://doi.org/10.1109/PACT.2015.37","url":null,"abstract":"The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy coherence protocols without either: changing the architecture's consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3's storage overhead per cache line scales logarithmically with increasing core count, and reduces on-chip coherence storage overheads by 45% compared to a related approach specifically targeting TSO.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130625980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications 相位感知的Warp调度:减轻GPGPU应用程序中相位行为的影响

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.31

Mihir Awatramani, Xian Zhu, Joseph Zambreno, D. Rover

Graphics Processing Units (GPUs) have been widely adopted as accelerators for high performance computing due to the immense amount of computational throughput they offer over their CPU counterparts. As GPU architectures are optimized for throughput, they execute a large number of SIMD threads (warps) in parallel and use hardware multithreading to hide the pipeline and memory access latencies. While the Two-Level Round Robin (TLRR) and Greedy Then Oldest (GTO) warp scheduling policies have been widely accepted in the academic research community, there is no consensus regarding which policy works best for all applications. In this paper, we show that the disparity regarding which scheduling policy works better depends on the characteristics of instructions in different regions (phases) of the application. We identify these phases at compile time and design a novel warp scheduling policy that uses information regarding them to make scheduling decisions at runtime. By mitigating the adverse effects of application phase behavior, our policy always performs closer to the better of the two existing policies for each application. We evaluate the performance of the warp schedulers on 35 kernels from the Rodinia and CUDA SDK benchmark suites. For applications that have a better performance with the GTO scheduler, our warp scheduler matches the performance of GTO with 99.2% accuracy and achieves an average speedup of 6.31% over RR. Similarly, for applications that perform better with RR, the performance of our scheduler is within of 98% of RR and achieves an average speedup of 6.65% over GTO.

图形处理单元(gpu)已被广泛采用为高性能计算的加速器，因为它们提供的计算吞吐量比对应的CPU高得多。由于GPU架构针对吞吐量进行了优化，因此它们并行执行大量SIMD线程(warp)，并使用硬件多线程来隐藏管道和内存访问延迟。虽然两级轮询调度(TLRR)和贪婪然后最老(GTO)的warp调度策略在学术研究界被广泛接受，但对于哪种策略最适合所有应用程序，还没有达成共识。在本文中，我们证明了哪个调度策略更有效的差异取决于应用程序的不同区域(阶段)指令的特征。我们在编译时识别这些阶段，并设计了一个新的warp调度策略，该策略使用有关它们的信息在运行时做出调度决策。通过减轻应用程序阶段行为的不利影响，我们的策略对于每个应用程序的执行总是更接近于两个现有策略中较好的一个。我们在Rodinia和CUDA SDK基准套件的35个内核上评估了warp调度器的性能。对于使用GTO调度器性能更好的应用程序，我们的warp调度器与GTO的性能匹配，准确率达到99.2%，平均加速比RR提高6.31%。同样，对于使用RR性能更好的应用程序，我们的调度器的性能在RR的98%以内，并且比GTO实现了6.65%的平均加速。

{"title":"Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications","authors":"Mihir Awatramani, Xian Zhu, Joseph Zambreno, D. Rover","doi":"10.1109/PACT.2015.31","DOIUrl":"https://doi.org/10.1109/PACT.2015.31","url":null,"abstract":"Graphics Processing Units (GPUs) have been widely adopted as accelerators for high performance computing due to the immense amount of computational throughput they offer over their CPU counterparts. As GPU architectures are optimized for throughput, they execute a large number of SIMD threads (warps) in parallel and use hardware multithreading to hide the pipeline and memory access latencies. While the Two-Level Round Robin (TLRR) and Greedy Then Oldest (GTO) warp scheduling policies have been widely accepted in the academic research community, there is no consensus regarding which policy works best for all applications. In this paper, we show that the disparity regarding which scheduling policy works better depends on the characteristics of instructions in different regions (phases) of the application. We identify these phases at compile time and design a novel warp scheduling policy that uses information regarding them to make scheduling decisions at runtime. By mitigating the adverse effects of application phase behavior, our policy always performs closer to the better of the two existing policies for each application. We evaluate the performance of the warp schedulers on 35 kernels from the Rodinia and CUDA SDK benchmark suites. For applications that have a better performance with the GTO scheduler, our warp scheduler matches the performance of GTO with 99.2% accuracy and achieves an average speedup of 6.31% over RR. Similarly, for applications that perform better with RR, the performance of our scheduler is within of 98% of RR and achieves an average speedup of 6.65% over GTO.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117298431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance AREP:用于最大化多核性能的自适应资源高效预取

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.35

Muneeb Khan, M. Laurenzano, Jason Mars, Erik Hagersten, D. Black-Schaffer

Modern processors widely use hardware prefetching to hide memory latency. While aggressive hardware prefetchers can improve performance significantly for some applications, they can limit the overall performance in highly-utilized multicore processors by saturating the offchip bandwidth and wasting last-level cache capacity. Co-executing applications can slowdown due to contention over these shared resources. This work introduces Adaptive Resource Efficient Prefetching (AREP) -- a runtime framework that dynamically combines software prefetching and hardware prefetching to maximize throughput in highly utilized multicore processors. AREP achieves better performance by prefetching data in a resource efficient way -- conserving offchip-bandwidth and last-level cache capacity with accurate prefetching and by applying cache-bypassing when possible. AREP dynamically explores a mix of hardware/software prefetching policies, then selects and applies the best performing policy. AREP is phase-aware and re-explores (at runtime) for the best prefetching policy at phase boundaries. A multitude of experiments with workload mixes and parallel applications on a modern high performance multicore show that AREP can increase throughput by up to 49% (8.1% on average). This is complemented by improved fairness, resulting in average quality of service above 94%.

现代处理器广泛使用硬件预取来隐藏内存延迟。虽然激进的硬件预取器可以显著提高某些应用程序的性能，但它们会使片外带宽饱和，并浪费最后一级缓存容量，从而限制高利用率多核处理器的整体性能。由于对这些共享资源的争用，共同执行的应用程序可能会减慢速度。这项工作引入了自适应资源高效预取(AREP)——一个动态结合软件预取和硬件预取的运行时框架，以最大限度地提高高利用率多核处理器的吞吐量。AREP通过以一种资源高效的方式预取数据来实现更好的性能——通过精确的预取和在可能的情况下应用缓存绕过来节省片外带宽和最后一级缓存容量。AREP动态地探索硬件/软件预取策略的组合，然后选择并应用性能最佳的策略。AREP是阶段感知的，并且(在运行时)在阶段边界重新探索最佳预取策略。在现代高性能多核上对工作负载混合和并行应用程序进行的大量实验表明，AREP可以将吞吐量提高49%(平均8.1%)。这与提高公平性相辅相成，导致平均服务质量超过94%。

{"title":"AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance","authors":"Muneeb Khan, M. Laurenzano, Jason Mars, Erik Hagersten, D. Black-Schaffer","doi":"10.1109/PACT.2015.35","DOIUrl":"https://doi.org/10.1109/PACT.2015.35","url":null,"abstract":"Modern processors widely use hardware prefetching to hide memory latency. While aggressive hardware prefetchers can improve performance significantly for some applications, they can limit the overall performance in highly-utilized multicore processors by saturating the offchip bandwidth and wasting last-level cache capacity. Co-executing applications can slowdown due to contention over these shared resources. This work introduces Adaptive Resource Efficient Prefetching (AREP) -- a runtime framework that dynamically combines software prefetching and hardware prefetching to maximize throughput in highly utilized multicore processors. AREP achieves better performance by prefetching data in a resource efficient way -- conserving offchip-bandwidth and last-level cache capacity with accurate prefetching and by applying cache-bypassing when possible. AREP dynamically explores a mix of hardware/software prefetching policies, then selects and applies the best performing policy. AREP is phase-aware and re-explores (at runtime) for the best prefetching policy at phase boundaries. A multitude of experiments with workload mixes and parallel applications on a modern high performance multicore show that AREP can increase throughput by up to 49% (8.1% on average). This is complemented by improved fairness, resulting in average quality of service above 94%.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114758577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Unified Identification of Multiple Forms of Parallelism in Embedded Applications 嵌入式应用中多种并行形式的统一识别

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.53

M. Aguilar, R. Leupers

The use of Multiprocessor Systems on Chip (MPSoCs) is a common practice in the design of state-of-the-art embedded devices, as MPSoCs provide a good trade-off between performance, energy and cost. However, programming MPSoCs is a challenging task, which currently involves multiple manual steps. Although, several research efforts have addressed this challenge, there is not yet a widely accepted solution. In this work, we describe an approach to automatically extract multiple forms of parallelism from sequential embedded applications in a unified manner. We evaluate the applicability of our work by parallelizing multiple embedded applications on two commercial platforms.

使用多处理器片上系统(mpsoc)是设计最先进的嵌入式设备的一种常见做法，因为mpsoc在性能，能源和成本之间提供了良好的权衡。然而，mpsoc的编程是一项具有挑战性的任务，目前涉及多个手动步骤。虽然，一些研究努力已经解决了这一挑战，但还没有一个被广泛接受的解决方案。在这项工作中，我们描述了一种以统一的方式从顺序嵌入式应用程序中自动提取多种形式的并行性的方法。我们通过在两个商业平台上并行化多个嵌入式应用程序来评估我们工作的适用性。

引用次数: 5

Stadium Hashing: Scalable and Flexible Hashing on GPUs 体育场哈希:在gpu上可扩展和灵活的哈希

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.13

Farzad Khorasani, M. Belviranli, Rajiv Gupta, L. Bhuyan

Hashing is one of the most fundamental operations that provides a means for a program to obtain fast access to large amounts of data. Despite the emergence of GPUs as many-threaded general purpose processors, high performance parallel data hashing solutions for GPUs are yet to receive adequate attention. Existing hashing solutions for GPUs not only impose restrictions (e.g., inability to concurrently execute insertion and retrieval operations, limitation on the size of key-value data pairs) that limit their applicability, their performance does not scale to large hash tables that must be kept out-of-core in the host memory. In this paper we present Stadium Hashing (Stash) that is scalable to large hash tables and practical as it does not impose the aforementioned restrictions. To support large out-of-core hash tables, Stash uses a compact data structure named ticket-board that is separate from hash table buckets and is held inside GPU global memory. Ticket-board locally resolves significant portion of insertion and lookup operations and hence, by reducing accesses to the host memory, it accelerates the execution of these operations. Split design of the ticket-board also enables arbitrarily large keys and values. Unlike existing methods, Stash naturally supports concurrent insertions and retrievals due to its use of double hashing as the collision resolution strategy. Furthermore, we propose Stash with collaborative lanes (clStash) that enhances GPU's SIMD resource utilization for batched insertions during hash table creation. For concurrent insertion and retrieval streams, Stadium hashing can be up to 2 and 3 times faster than GPU Cuckoo hashing for in-core and out-of-core tables respectively.

散列是最基本的操作之一，它为程序提供了一种快速访问大量数据的方法。尽管gpu作为多线程通用处理器出现，但gpu的高性能并行数据散列解决方案尚未得到足够的重视。现有的gpu散列解决方案不仅施加了限制(例如，无法并发执行插入和检索操作，限制键值数据对的大小)，限制了它们的适用性，而且它们的性能不能扩展到必须在主机内存中保存在核外的大型散列表。在本文中，我们介绍了体育场哈希(Stash)，它可扩展到大型哈希表，并且由于没有施加上述限制而实用。为了支持大型核外哈希表，Stash使用了一个名为ticket-board的紧凑数据结构，它与哈希表桶分开，并保存在GPU全局内存中。Ticket-board在本地解析了相当一部分插入和查找操作，因此，通过减少对主机内存的访问，它加速了这些操作的执行。分割设计的票板也允许任意大的键和值。与现有方法不同，由于使用双散列作为冲突解决策略，Stash自然支持并发插入和检索。此外，我们提出了具有协作通道的Stash (clStash)，它增强了GPU在哈希表创建期间批量插入的SIMD资源利用率。对于并发插入和检索流，Stadium散列可以分别比GPU Cuckoo散列在核内表和核外表上快2倍和3倍。

{"title":"Stadium Hashing: Scalable and Flexible Hashing on GPUs","authors":"Farzad Khorasani, M. Belviranli, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2015.13","DOIUrl":"https://doi.org/10.1109/PACT.2015.13","url":null,"abstract":"Hashing is one of the most fundamental operations that provides a means for a program to obtain fast access to large amounts of data. Despite the emergence of GPUs as many-threaded general purpose processors, high performance parallel data hashing solutions for GPUs are yet to receive adequate attention. Existing hashing solutions for GPUs not only impose restrictions (e.g., inability to concurrently execute insertion and retrieval operations, limitation on the size of key-value data pairs) that limit their applicability, their performance does not scale to large hash tables that must be kept out-of-core in the host memory. In this paper we present Stadium Hashing (Stash) that is scalable to large hash tables and practical as it does not impose the aforementioned restrictions. To support large out-of-core hash tables, Stash uses a compact data structure named ticket-board that is separate from hash table buckets and is held inside GPU global memory. Ticket-board locally resolves significant portion of insertion and lookup operations and hence, by reducing accesses to the host memory, it accelerates the execution of these operations. Split design of the ticket-board also enables arbitrarily large keys and values. Unlike existing methods, Stash naturally supports concurrent insertions and retrievals due to its use of double hashing as the collision resolution strategy. Furthermore, we propose Stash with collaborative lanes (clStash) that enhances GPU's SIMD resource utilization for batched insertions during hash table creation. For concurrent insertion and retrieval streams, Stadium hashing can be up to 2 and 3 times faster than GPU Cuckoo hashing for in-core and out-of-core tables respectively.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131862264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

TSXProf: Profiling Hardware Transactions tsx教授:分析硬件事务

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.28

Yujie Liu, Justin Emile Gottschlich, Gilles A. Pokam, Michael F. Spear

The availability of commercial hardware transactionalmemory (TM) systems has not yet been met with a rise in the numberof large-scale programs that use memory transactions explicitly. Asignificant impediment to the use of TM is the lack of tool support, specifically profilers that can identify and explain performance anomalies. In this paper, we introduce an end-to-end system that enables lowoverheadperformance profiling of large-scale transactional programs. We present algorithms and an implementation for Intel's Haswellprocessors. With our system, it is possible to record a transactionalprogram's execution with minimal overhead, and then replay it withina custom profiling tool to identify causes of contention and aborts, down to the granularity of individual memory accesses. Evaluationshows that our algorithms have low overhead, and our tools enableprogrammers to effectively explain performance anomalies.

商用硬件事务内存(TM)系统的可用性还没有满足显式使用内存事务的大规模程序数量的增加。使用TM的一个重要障碍是缺乏工具支持，特别是能够识别和解释性能异常的分析器。在本文中，我们介绍了一个端到端系统，它支持大规模事务性程序的低开销性能分析。我们提出了英特尔haswell处理器的算法和实现。使用我们的系统，可以用最小的开销记录事务程序的执行，然后在自定义分析工具中重播它，以确定争用和中止的原因，直至单个内存访问的粒度。评估表明，我们的算法开销低，我们的工具使程序员能够有效地解释性能异常。

引用次数: 9

Polyhedral Optimizations of Explicitly Parallel Programs 显式并行程序的多面体优化

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.44

Prasanth Chatarasi, J. Shirako, Vivek Sarkar

The polyhedral model is a powerful algebraic framework that has enabled significant advances to analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral frameworks to optimize explicitly parallel programs. An interesting side effect of supporting explicitly parallel programs is that doing so can also enable optimization of programs with unanalyzable data accesses within a polyhedral framework. In this paper, we address the problem of extending polyhedral frameworks to enable analysis and transformation of programs that contain both explicit parallelism and unanalyzable data accesses. As a first step, we focus on OpenMP loop parallelism and task parallelism, including task dependences from OpenMP 4.0. Our approach first enables conservative dependence analysis of a given region of code. Next, we identify happens-before relations from the explicitly parallel constructs, such as tasks and parallel loops, and intersect them with the conservative dependences. Finally, the resulting set of dependences is passed on to a polyhedral optimizer, such as PLuTo and PolyAST, to enable transformation of explicitly parallel programs with unanalyzable data accesses. We evaluate our approach using eleven OpenMP benchmark programs from the KASTORS and Rodinia benchmark suites. We show that 1) these benchmarks contain unanalyzable data accesses that prevent polyhedral frameworks from performing exact dependence analysis, 2) explicit parallelism can help mitigate the imprecision, and 3) polyhedral transformations with the resulting dependences can further improve the performance of the manually-parallelized OpenMP benchmarks. Our experimental results show geometric mean performance improvements of 1.62x and 2.75x on the Intel Westmere and IBM Power8 platforms respectively (relative to the original OpenMP versions).

多面体模型是一个强大的代数框架，与传统的基于ast的方法相比，它在序列仿射(子)程序的分析和转换方面取得了重大进展。然而，鉴于并行软件的快速发展，需要更多地关注使用多面体框架来显式优化并行程序。支持显式并行程序的一个有趣的副作用是，这样做还可以优化多面体框架中具有不可分析数据访问的程序。在本文中，我们解决了扩展多面体框架的问题，以便对包含显式并行性和不可分析数据访问的程序进行分析和转换。作为第一步，我们将重点关注OpenMP循环并行性和任务并行性，包括OpenMP 4.0的任务依赖性。我们的方法首先允许对给定代码区域进行保守依赖性分析。接下来，我们从显式并行结构(如任务和并行循环)中识别happens-before关系，并将它们与保守依赖关系相交。最后，依赖性的结果集被传递给多面体优化器，例如PLuTo和PolyAST，以启用具有不可分析数据访问的显式并行程序的转换。我们使用KASTORS和Rodinia基准套件中的11个OpenMP基准程序来评估我们的方法。我们表明:1)这些基准测试包含无法分析的数据访问，这会阻止多面体框架执行精确的依赖性分析;2)显式并行可以帮助减轻不精确性;3)多面体转换产生的依赖性可以进一步提高手动并行化OpenMP基准测试的性能。我们的实验结果显示，在Intel Westmere和IBM Power8平台上(相对于最初的OpenMP版本)，几何平均性能分别提高了1.62倍和2.75倍。

{"title":"Polyhedral Optimizations of Explicitly Parallel Programs","authors":"Prasanth Chatarasi, J. Shirako, Vivek Sarkar","doi":"10.1109/PACT.2015.44","DOIUrl":"https://doi.org/10.1109/PACT.2015.44","url":null,"abstract":"The polyhedral model is a powerful algebraic framework that has enabled significant advances to analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral frameworks to optimize explicitly parallel programs. An interesting side effect of supporting explicitly parallel programs is that doing so can also enable optimization of programs with unanalyzable data accesses within a polyhedral framework. In this paper, we address the problem of extending polyhedral frameworks to enable analysis and transformation of programs that contain both explicit parallelism and unanalyzable data accesses. As a first step, we focus on OpenMP loop parallelism and task parallelism, including task dependences from OpenMP 4.0. Our approach first enables conservative dependence analysis of a given region of code. Next, we identify happens-before relations from the explicitly parallel constructs, such as tasks and parallel loops, and intersect them with the conservative dependences. Finally, the resulting set of dependences is passed on to a polyhedral optimizer, such as PLuTo and PolyAST, to enable transformation of explicitly parallel programs with unanalyzable data accesses. We evaluate our approach using eleven OpenMP benchmark programs from the KASTORS and Rodinia benchmark suites. We show that 1) these benchmarks contain unanalyzable data accesses that prevent polyhedral frameworks from performing exact dependence analysis, 2) explicit parallelism can help mitigate the imprecision, and 3) polyhedral transformations with the resulting dependences can further improve the performance of the manually-parallelized OpenMP benchmarks. Our experimental results show geometric mean performance improvements of 1.62x and 2.75x on the Intel Westmere and IBM Power8 platforms respectively (relative to the original OpenMP versions).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126217427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Extending Polyhedral Model for Analysis and Transformation of OpenMP Programs OpenMP程序分析与转换的扩展多面体模型

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.57

Prasanth Chatarasi, Vivek Sarkar

The polyhedral model is a powerful algebraic framework that has enabled significant advances in analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral compilation techniques to analyze and transform explicitly parallel programs. In our PACT'15 paper titled "Polyhedral Optimizations of Explicitly Parallel Programs" [1, 2], we addressed the problem of analyzing and transforming programs with explicit parallelism that satisfy the serial-elision property, i.e., the property that removal of all parallel constructs results in a sequential program that is a valid (albeit inefficient) implementation of the parallel program semantics.In this poster, we address the problem of analyzing and transforming more general OpenMP programs that do not satisfy the serial-elision property. Our contributions include the following: 1) An extension of the polyhedral model to represent input OpenMP programs, 2) Formalization of May Happen in Parallel (MHP) and Happens before (HB) relations in the extended model, 3) An approach for static detection of data races in OpenMP programs by generating race constraints that can be solved by an SMT solver such as Z3, and 4) An approach for transforming OpenMP programs.

多面体模型是一个强大的代数框架，与传统的基于ast的方法相比，它在序列仿射(子)程序的分析和转换方面取得了重大进展。然而，鉴于并行软件的快速发展，需要更多地关注使用多面体编译技术来显式分析和转换并行程序。在我们的PACT'15论文题为“显式并行程序的多面体优化”[1,2]中，我们解决了分析和转换具有显式并行性的程序的问题，这些程序满足串行省略属性，即删除所有并行构造的属性导致顺序程序是并行程序语义的有效(尽管效率低下)实现。在这张海报中，我们解决了分析和转换不满足串行省略属性的更通用的OpenMP程序的问题。我们的贡献包括:1)扩展多面体模型来表示输入OpenMP程序，2)形式化扩展模型中的May Happen in Parallel (MHP)和Happens before (HB)关系，3)通过生成可由SMT求解器(如Z3)求解的竞争约束，静态检测OpenMP程序中的数据竞争的方法，以及4)转换OpenMP程序的方法。

{"title":"Extending Polyhedral Model for Analysis and Transformation of OpenMP Programs","authors":"Prasanth Chatarasi, Vivek Sarkar","doi":"10.1109/PACT.2015.57","DOIUrl":"https://doi.org/10.1109/PACT.2015.57","url":null,"abstract":"The polyhedral model is a powerful algebraic framework that has enabled significant advances in analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral compilation techniques to analyze and transform explicitly parallel programs. In our PACT'15 paper titled \"Polyhedral Optimizations of Explicitly Parallel Programs\" [1, 2], we addressed the problem of analyzing and transforming programs with explicit parallelism that satisfy the serial-elision property, i.e., the property that removal of all parallel constructs results in a sequential program that is a valid (albeit inefficient) implementation of the parallel program semantics.In this poster, we address the problem of analyzing and transforming more general OpenMP programs that do not satisfy the serial-elision property. Our contributions include the following: 1) An extension of the polyhedral model to represent input OpenMP programs, 2) Formalization of May Happen in Parallel (MHP) and Happens before (HB) relations in the extended model, 3) An approach for static detection of data races in OpenMP programs by generating race constraints that can be solved by an SMT solver such as Z3, and 4) An approach for transforming OpenMP programs.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125033807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ALEA: Fine-Grain Energy Profiling with Basic Block Sampling ALEA:基于基本块采样的细颗粒能量剖面

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-04-03 DOI: 10.1109/PACT.2015.16

L. Mukhanov, Dimitrios S. Nikolopoulos, B. D. Supinski

Energy efficiency is an essential requirement for all contemporary computing systems. We thus need tools to measure the energy consumption of computing systems and to understand how workloads affect it. Significant recent research effort has targeted direct power measurements on production computing systems using on-board sensors or external instruments. These direct methods have in turn guided studies of software techniques to reduce energy consumption via workload allocation and scaling. Unfortunately, direct energymeasurementsarehamperedbythelowpowersampling frequency of power sensors. The coarse granularity of power sensing limits our understanding of how power is allocated in systems and our ability to optimize energy efficiency via workload allocation. We present ALEA, a tool to measure power and energy consumption at the granularity of basic blocks, using a probabilistic approach. ALEA provides fine-grained energy profiling via statistical sampling, which overcomes the limitations of power sensing instruments. Compared to state-of-the-art energy measurement tools, ALEA provides finer granularity without sacrificing accuracy. ALEA achieves low overhead energy measurements with mean error rates between 1.4% and 3.5% in 14 sequential and parallel benchmarks tested on both Intel and ARM platforms. The sampling method caps execution time overhead at approximately 1%. ALEA is thus suitable for online energy monitoring and optimization. Finally, ALEA is a user-space tool with a portable, machine-independent sampling method. We demonstrate three use cases of ALEA, where we reduce the energy consumption of a k-means computational kernel by 37%, an ocean modeling code by 33%, and a ray tracing code by 6% compared to high-performance execution baselines, by varying the power optimization strategy between basic blocks.

能源效率是所有当代计算系统的基本要求。因此，我们需要工具来测量计算系统的能耗，并了解工作负载如何影响它。最近的重大研究工作是针对使用车载传感器或外部仪器对生产计算系统进行直接功率测量。这些直接的方法反过来又指导了软件技术的研究，通过工作量分配和扩展来减少能源消耗。不幸的是，直接能量测量受到功率传感器的低功率采样频率的阻碍。功率传感的粗粒度限制了我们对系统中如何分配功率的理解，以及我们通过工作负载分配优化能源效率的能力。我们提出了ALEA，一种使用概率方法在基本块粒度上测量功率和能源消耗的工具。ALEA通过统计采样提供细粒度的能量剖面，克服了功率传感仪器的局限性。与最先进的能量测量工具相比，ALEA在不牺牲精度的情况下提供更细的粒度。ALEA在英特尔和ARM平台上测试的14个顺序和并行基准测试中实现了低开销能量测量，平均错误率在1.4%到3.5%之间。抽样方法将执行时间开销限制在大约1%。因此，ALEA适用于在线能源监测和优化。最后，ALEA是一个用户空间工具，具有可移植的、与机器无关的采样方法。我们展示了ALEA的三个用例，通过改变基本块之间的功率优化策略，与高性能执行基线相比，我们将k-means计算内核的能耗降低了37%，海洋建模代码降低了33%，光线跟踪代码降低了6%。

{"title":"ALEA: Fine-Grain Energy Profiling with Basic Block Sampling","authors":"L. Mukhanov, Dimitrios S. Nikolopoulos, B. D. Supinski","doi":"10.1109/PACT.2015.16","DOIUrl":"https://doi.org/10.1109/PACT.2015.16","url":null,"abstract":"Energy efficiency is an essential requirement for all contemporary computing systems. We thus need tools to measure the energy consumption of computing systems and to understand how workloads affect it. Significant recent research effort has targeted direct power measurements on production computing systems using on-board sensors or external instruments. These direct methods have in turn guided studies of software techniques to reduce energy consumption via workload allocation and scaling. Unfortunately, direct energymeasurementsarehamperedbythelowpowersampling frequency of power sensors. The coarse granularity of power sensing limits our understanding of how power is allocated in systems and our ability to optimize energy efficiency via workload allocation. We present ALEA, a tool to measure power and energy consumption at the granularity of basic blocks, using a probabilistic approach. ALEA provides fine-grained energy profiling via statistical sampling, which overcomes the limitations of power sensing instruments. Compared to state-of-the-art energy measurement tools, ALEA provides finer granularity without sacrificing accuracy. ALEA achieves low overhead energy measurements with mean error rates between 1.4% and 3.5% in 14 sequential and parallel benchmarks tested on both Intel and ARM platforms. The sampling method caps execution time overhead at approximately 1%. ALEA is thus suitable for online energy monitoring and optimization. Finally, ALEA is a user-space tool with a portable, machine-independent sampling method. We demonstrate three use cases of ALEA, where we reduce the energy consumption of a k-means computational kernel by 37%, an ocean modeling code by 33%, and a ray tracing code by 6% compared to high-performance execution baselines, by varying the power optimization strategy between basic blocks.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121253618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27