Stephen Heumann, Alexandros Tzannes, Vikram S. Adve
Several concurrent programming models that give strong safety guarantees employ effect specifications that indicate what effects on shared state a piece of code may perform. These specifications can be much more expressive than traditional synchronization mechanisms like locks, and they are amenable to static and/or dynamic checking approaches for ensuring safety properties. The Tasks With Effects (TWE) programming model uses dynamic checking to give nearly the strongest safety guarantees of any existing shared memory language while providing the flexibility to express both structured and unstructured concurrency. Like several other systems, TWE's effect specifications use hierarchical memory regions, which can naturally model nested and modular data structures and allow effects to be expressed at different levels of granularity in different parts of a program. To implement a programming model like TWE with high performance, particularly for programs with many fine-grain tasks, the run-time task scheduler must employ an algorithm that can enforce task isolation (mutual exclusion of tasks with conflicting effects) with low overhead and high scalability. This paper describes such an algorithm for TWE. It uses a scheduling tree designed to take advantage of the hierarchical structure of TWE effects, obtaining two key properties that lead to high scalability: (a) effects need to be compared only for ancestor and descendant nodes in the tree, and not any other nodes, and (b) the scheduler can use fine-grain locking of tree nodes to enable highly concurrent scheduling operations. We prove formally that the algorithm guarantees task isolation. Experimental results with a range of programs show that the algorithm provides very good scalability, even with fine-grain tasks.
{"title":"Scalable Task Scheduling and Synchronization Using Hierarchical Effects","authors":"Stephen Heumann, Alexandros Tzannes, Vikram S. Adve","doi":"10.1109/PACT.2015.25","DOIUrl":"https://doi.org/10.1109/PACT.2015.25","url":null,"abstract":"Several concurrent programming models that give strong safety guarantees employ effect specifications that indicate what effects on shared state a piece of code may perform. These specifications can be much more expressive than traditional synchronization mechanisms like locks, and they are amenable to static and/or dynamic checking approaches for ensuring safety properties. The Tasks With Effects (TWE) programming model uses dynamic checking to give nearly the strongest safety guarantees of any existing shared memory language while providing the flexibility to express both structured and unstructured concurrency. Like several other systems, TWE's effect specifications use hierarchical memory regions, which can naturally model nested and modular data structures and allow effects to be expressed at different levels of granularity in different parts of a program. To implement a programming model like TWE with high performance, particularly for programs with many fine-grain tasks, the run-time task scheduler must employ an algorithm that can enforce task isolation (mutual exclusion of tasks with conflicting effects) with low overhead and high scalability. This paper describes such an algorithm for TWE. It uses a scheduling tree designed to take advantage of the hierarchical structure of TWE effects, obtaining two key properties that lead to high scalability: (a) effects need to be compared only for ancestor and descendant nodes in the tree, and not any other nodes, and (b) the scheduler can use fine-grain locking of tree nodes to enable highly concurrent scheduling operations. We prove formally that the algorithm guarantees task isolation. Experimental results with a range of programs show that the algorithm provides very good scalability, even with fine-grain tasks.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115207718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy coherence protocols without either: changing the architecture's consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3's storage overhead per cache line scales logarithmically with increasing core count, and reduces on-chip coherence storage overheads by 45% compared to a related approach specifically targeting TSO.
{"title":"RC3: Consistency Directed Cache Coherence for x86-64 with RC Extensions","authors":"M. Elver, V. Nagarajan","doi":"10.1109/PACT.2015.37","DOIUrl":"https://doi.org/10.1109/PACT.2015.37","url":null,"abstract":"The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy coherence protocols without either: changing the architecture's consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3's storage overhead per cache line scales logarithmically with increasing core count, and reduces on-chip coherence storage overheads by 45% compared to a related approach specifically targeting TSO.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130625980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mihir Awatramani, Xian Zhu, Joseph Zambreno, D. Rover
Graphics Processing Units (GPUs) have been widely adopted as accelerators for high performance computing due to the immense amount of computational throughput they offer over their CPU counterparts. As GPU architectures are optimized for throughput, they execute a large number of SIMD threads (warps) in parallel and use hardware multithreading to hide the pipeline and memory access latencies. While the Two-Level Round Robin (TLRR) and Greedy Then Oldest (GTO) warp scheduling policies have been widely accepted in the academic research community, there is no consensus regarding which policy works best for all applications. In this paper, we show that the disparity regarding which scheduling policy works better depends on the characteristics of instructions in different regions (phases) of the application. We identify these phases at compile time and design a novel warp scheduling policy that uses information regarding them to make scheduling decisions at runtime. By mitigating the adverse effects of application phase behavior, our policy always performs closer to the better of the two existing policies for each application. We evaluate the performance of the warp schedulers on 35 kernels from the Rodinia and CUDA SDK benchmark suites. For applications that have a better performance with the GTO scheduler, our warp scheduler matches the performance of GTO with 99.2% accuracy and achieves an average speedup of 6.31% over RR. Similarly, for applications that perform better with RR, the performance of our scheduler is within of 98% of RR and achieves an average speedup of 6.65% over GTO.
{"title":"Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications","authors":"Mihir Awatramani, Xian Zhu, Joseph Zambreno, D. Rover","doi":"10.1109/PACT.2015.31","DOIUrl":"https://doi.org/10.1109/PACT.2015.31","url":null,"abstract":"Graphics Processing Units (GPUs) have been widely adopted as accelerators for high performance computing due to the immense amount of computational throughput they offer over their CPU counterparts. As GPU architectures are optimized for throughput, they execute a large number of SIMD threads (warps) in parallel and use hardware multithreading to hide the pipeline and memory access latencies. While the Two-Level Round Robin (TLRR) and Greedy Then Oldest (GTO) warp scheduling policies have been widely accepted in the academic research community, there is no consensus regarding which policy works best for all applications. In this paper, we show that the disparity regarding which scheduling policy works better depends on the characteristics of instructions in different regions (phases) of the application. We identify these phases at compile time and design a novel warp scheduling policy that uses information regarding them to make scheduling decisions at runtime. By mitigating the adverse effects of application phase behavior, our policy always performs closer to the better of the two existing policies for each application. We evaluate the performance of the warp schedulers on 35 kernels from the Rodinia and CUDA SDK benchmark suites. For applications that have a better performance with the GTO scheduler, our warp scheduler matches the performance of GTO with 99.2% accuracy and achieves an average speedup of 6.31% over RR. Similarly, for applications that perform better with RR, the performance of our scheduler is within of 98% of RR and achieves an average speedup of 6.65% over GTO.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117298431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muneeb Khan, M. Laurenzano, Jason Mars, Erik Hagersten, D. Black-Schaffer
Modern processors widely use hardware prefetching to hide memory latency. While aggressive hardware prefetchers can improve performance significantly for some applications, they can limit the overall performance in highly-utilized multicore processors by saturating the offchip bandwidth and wasting last-level cache capacity. Co-executing applications can slowdown due to contention over these shared resources. This work introduces Adaptive Resource Efficient Prefetching (AREP) -- a runtime framework that dynamically combines software prefetching and hardware prefetching to maximize throughput in highly utilized multicore processors. AREP achieves better performance by prefetching data in a resource efficient way -- conserving offchip-bandwidth and last-level cache capacity with accurate prefetching and by applying cache-bypassing when possible. AREP dynamically explores a mix of hardware/software prefetching policies, then selects and applies the best performing policy. AREP is phase-aware and re-explores (at runtime) for the best prefetching policy at phase boundaries. A multitude of experiments with workload mixes and parallel applications on a modern high performance multicore show that AREP can increase throughput by up to 49% (8.1% on average). This is complemented by improved fairness, resulting in average quality of service above 94%.
{"title":"AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance","authors":"Muneeb Khan, M. Laurenzano, Jason Mars, Erik Hagersten, D. Black-Schaffer","doi":"10.1109/PACT.2015.35","DOIUrl":"https://doi.org/10.1109/PACT.2015.35","url":null,"abstract":"Modern processors widely use hardware prefetching to hide memory latency. While aggressive hardware prefetchers can improve performance significantly for some applications, they can limit the overall performance in highly-utilized multicore processors by saturating the offchip bandwidth and wasting last-level cache capacity. Co-executing applications can slowdown due to contention over these shared resources. This work introduces Adaptive Resource Efficient Prefetching (AREP) -- a runtime framework that dynamically combines software prefetching and hardware prefetching to maximize throughput in highly utilized multicore processors. AREP achieves better performance by prefetching data in a resource efficient way -- conserving offchip-bandwidth and last-level cache capacity with accurate prefetching and by applying cache-bypassing when possible. AREP dynamically explores a mix of hardware/software prefetching policies, then selects and applies the best performing policy. AREP is phase-aware and re-explores (at runtime) for the best prefetching policy at phase boundaries. A multitude of experiments with workload mixes and parallel applications on a modern high performance multicore show that AREP can increase throughput by up to 49% (8.1% on average). This is complemented by improved fairness, resulting in average quality of service above 94%.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114758577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The use of Multiprocessor Systems on Chip (MPSoCs) is a common practice in the design of state-of-the-art embedded devices, as MPSoCs provide a good trade-off between performance, energy and cost. However, programming MPSoCs is a challenging task, which currently involves multiple manual steps. Although, several research efforts have addressed this challenge, there is not yet a widely accepted solution. In this work, we describe an approach to automatically extract multiple forms of parallelism from sequential embedded applications in a unified manner. We evaluate the applicability of our work by parallelizing multiple embedded applications on two commercial platforms.
{"title":"Unified Identification of Multiple Forms of Parallelism in Embedded Applications","authors":"M. Aguilar, R. Leupers","doi":"10.1109/PACT.2015.53","DOIUrl":"https://doi.org/10.1109/PACT.2015.53","url":null,"abstract":"The use of Multiprocessor Systems on Chip (MPSoCs) is a common practice in the design of state-of-the-art embedded devices, as MPSoCs provide a good trade-off between performance, energy and cost. However, programming MPSoCs is a challenging task, which currently involves multiple manual steps. Although, several research efforts have addressed this challenge, there is not yet a widely accepted solution. In this work, we describe an approach to automatically extract multiple forms of parallelism from sequential embedded applications in a unified manner. We evaluate the applicability of our work by parallelizing multiple embedded applications on two commercial platforms.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121453305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Farzad Khorasani, M. Belviranli, Rajiv Gupta, L. Bhuyan
Hashing is one of the most fundamental operations that provides a means for a program to obtain fast access to large amounts of data. Despite the emergence of GPUs as many-threaded general purpose processors, high performance parallel data hashing solutions for GPUs are yet to receive adequate attention. Existing hashing solutions for GPUs not only impose restrictions (e.g., inability to concurrently execute insertion and retrieval operations, limitation on the size of key-value data pairs) that limit their applicability, their performance does not scale to large hash tables that must be kept out-of-core in the host memory. In this paper we present Stadium Hashing (Stash) that is scalable to large hash tables and practical as it does not impose the aforementioned restrictions. To support large out-of-core hash tables, Stash uses a compact data structure named ticket-board that is separate from hash table buckets and is held inside GPU global memory. Ticket-board locally resolves significant portion of insertion and lookup operations and hence, by reducing accesses to the host memory, it accelerates the execution of these operations. Split design of the ticket-board also enables arbitrarily large keys and values. Unlike existing methods, Stash naturally supports concurrent insertions and retrievals due to its use of double hashing as the collision resolution strategy. Furthermore, we propose Stash with collaborative lanes (clStash) that enhances GPU's SIMD resource utilization for batched insertions during hash table creation. For concurrent insertion and retrieval streams, Stadium hashing can be up to 2 and 3 times faster than GPU Cuckoo hashing for in-core and out-of-core tables respectively.
{"title":"Stadium Hashing: Scalable and Flexible Hashing on GPUs","authors":"Farzad Khorasani, M. Belviranli, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2015.13","DOIUrl":"https://doi.org/10.1109/PACT.2015.13","url":null,"abstract":"Hashing is one of the most fundamental operations that provides a means for a program to obtain fast access to large amounts of data. Despite the emergence of GPUs as many-threaded general purpose processors, high performance parallel data hashing solutions for GPUs are yet to receive adequate attention. Existing hashing solutions for GPUs not only impose restrictions (e.g., inability to concurrently execute insertion and retrieval operations, limitation on the size of key-value data pairs) that limit their applicability, their performance does not scale to large hash tables that must be kept out-of-core in the host memory. In this paper we present Stadium Hashing (Stash) that is scalable to large hash tables and practical as it does not impose the aforementioned restrictions. To support large out-of-core hash tables, Stash uses a compact data structure named ticket-board that is separate from hash table buckets and is held inside GPU global memory. Ticket-board locally resolves significant portion of insertion and lookup operations and hence, by reducing accesses to the host memory, it accelerates the execution of these operations. Split design of the ticket-board also enables arbitrarily large keys and values. Unlike existing methods, Stash naturally supports concurrent insertions and retrievals due to its use of double hashing as the collision resolution strategy. Furthermore, we propose Stash with collaborative lanes (clStash) that enhances GPU's SIMD resource utilization for batched insertions during hash table creation. For concurrent insertion and retrieval streams, Stadium hashing can be up to 2 and 3 times faster than GPU Cuckoo hashing for in-core and out-of-core tables respectively.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131862264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yujie Liu, Justin Emile Gottschlich, Gilles A. Pokam, Michael F. Spear
The availability of commercial hardware transactionalmemory (TM) systems has not yet been met with a rise in the numberof large-scale programs that use memory transactions explicitly. Asignificant impediment to the use of TM is the lack of tool support, specifically profilers that can identify and explain performance anomalies. In this paper, we introduce an end-to-end system that enables lowoverheadperformance profiling of large-scale transactional programs. We present algorithms and an implementation for Intel's Haswellprocessors. With our system, it is possible to record a transactionalprogram's execution with minimal overhead, and then replay it withina custom profiling tool to identify causes of contention and aborts, down to the granularity of individual memory accesses. Evaluationshows that our algorithms have low overhead, and our tools enableprogrammers to effectively explain performance anomalies.
{"title":"TSXProf: Profiling Hardware Transactions","authors":"Yujie Liu, Justin Emile Gottschlich, Gilles A. Pokam, Michael F. Spear","doi":"10.1109/PACT.2015.28","DOIUrl":"https://doi.org/10.1109/PACT.2015.28","url":null,"abstract":"The availability of commercial hardware transactionalmemory (TM) systems has not yet been met with a rise in the numberof large-scale programs that use memory transactions explicitly. Asignificant impediment to the use of TM is the lack of tool support, specifically profilers that can identify and explain performance anomalies. In this paper, we introduce an end-to-end system that enables lowoverheadperformance profiling of large-scale transactional programs. We present algorithms and an implementation for Intel's Haswellprocessors. With our system, it is possible to record a transactionalprogram's execution with minimal overhead, and then replay it withina custom profiling tool to identify causes of contention and aborts, down to the granularity of individual memory accesses. Evaluationshows that our algorithms have low overhead, and our tools enableprogrammers to effectively explain performance anomalies.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129809011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The polyhedral model is a powerful algebraic framework that has enabled significant advances to analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral frameworks to optimize explicitly parallel programs. An interesting side effect of supporting explicitly parallel programs is that doing so can also enable optimization of programs with unanalyzable data accesses within a polyhedral framework. In this paper, we address the problem of extending polyhedral frameworks to enable analysis and transformation of programs that contain both explicit parallelism and unanalyzable data accesses. As a first step, we focus on OpenMP loop parallelism and task parallelism, including task dependences from OpenMP 4.0. Our approach first enables conservative dependence analysis of a given region of code. Next, we identify happens-before relations from the explicitly parallel constructs, such as tasks and parallel loops, and intersect them with the conservative dependences. Finally, the resulting set of dependences is passed on to a polyhedral optimizer, such as PLuTo and PolyAST, to enable transformation of explicitly parallel programs with unanalyzable data accesses. We evaluate our approach using eleven OpenMP benchmark programs from the KASTORS and Rodinia benchmark suites. We show that 1) these benchmarks contain unanalyzable data accesses that prevent polyhedral frameworks from performing exact dependence analysis, 2) explicit parallelism can help mitigate the imprecision, and 3) polyhedral transformations with the resulting dependences can further improve the performance of the manually-parallelized OpenMP benchmarks. Our experimental results show geometric mean performance improvements of 1.62x and 2.75x on the Intel Westmere and IBM Power8 platforms respectively (relative to the original OpenMP versions).
{"title":"Polyhedral Optimizations of Explicitly Parallel Programs","authors":"Prasanth Chatarasi, J. Shirako, Vivek Sarkar","doi":"10.1109/PACT.2015.44","DOIUrl":"https://doi.org/10.1109/PACT.2015.44","url":null,"abstract":"The polyhedral model is a powerful algebraic framework that has enabled significant advances to analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral frameworks to optimize explicitly parallel programs. An interesting side effect of supporting explicitly parallel programs is that doing so can also enable optimization of programs with unanalyzable data accesses within a polyhedral framework. In this paper, we address the problem of extending polyhedral frameworks to enable analysis and transformation of programs that contain both explicit parallelism and unanalyzable data accesses. As a first step, we focus on OpenMP loop parallelism and task parallelism, including task dependences from OpenMP 4.0. Our approach first enables conservative dependence analysis of a given region of code. Next, we identify happens-before relations from the explicitly parallel constructs, such as tasks and parallel loops, and intersect them with the conservative dependences. Finally, the resulting set of dependences is passed on to a polyhedral optimizer, such as PLuTo and PolyAST, to enable transformation of explicitly parallel programs with unanalyzable data accesses. We evaluate our approach using eleven OpenMP benchmark programs from the KASTORS and Rodinia benchmark suites. We show that 1) these benchmarks contain unanalyzable data accesses that prevent polyhedral frameworks from performing exact dependence analysis, 2) explicit parallelism can help mitigate the imprecision, and 3) polyhedral transformations with the resulting dependences can further improve the performance of the manually-parallelized OpenMP benchmarks. Our experimental results show geometric mean performance improvements of 1.62x and 2.75x on the Intel Westmere and IBM Power8 platforms respectively (relative to the original OpenMP versions).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126217427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The polyhedral model is a powerful algebraic framework that has enabled significant advances in analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral compilation techniques to analyze and transform explicitly parallel programs. In our PACT'15 paper titled "Polyhedral Optimizations of Explicitly Parallel Programs" [1, 2], we addressed the problem of analyzing and transforming programs with explicit parallelism that satisfy the serial-elision property, i.e., the property that removal of all parallel constructs results in a sequential program that is a valid (albeit inefficient) implementation of the parallel program semantics.In this poster, we address the problem of analyzing and transforming more general OpenMP programs that do not satisfy the serial-elision property. Our contributions include the following: 1) An extension of the polyhedral model to represent input OpenMP programs, 2) Formalization of May Happen in Parallel (MHP) and Happens before (HB) relations in the extended model, 3) An approach for static detection of data races in OpenMP programs by generating race constraints that can be solved by an SMT solver such as Z3, and 4) An approach for transforming OpenMP programs.
多面体模型是一个强大的代数框架,与传统的基于ast的方法相比,它在序列仿射(子)程序的分析和转换方面取得了重大进展。然而,鉴于并行软件的快速发展,需要更多地关注使用多面体编译技术来显式分析和转换并行程序。在我们的PACT'15论文题为“显式并行程序的多面体优化”[1,2]中,我们解决了分析和转换具有显式并行性的程序的问题,这些程序满足串行省略属性,即删除所有并行构造的属性导致顺序程序是并行程序语义的有效(尽管效率低下)实现。在这张海报中,我们解决了分析和转换不满足串行省略属性的更通用的OpenMP程序的问题。我们的贡献包括:1)扩展多面体模型来表示输入OpenMP程序,2)形式化扩展模型中的May Happen in Parallel (MHP)和Happens before (HB)关系,3)通过生成可由SMT求解器(如Z3)求解的竞争约束,静态检测OpenMP程序中的数据竞争的方法,以及4)转换OpenMP程序的方法。
{"title":"Extending Polyhedral Model for Analysis and Transformation of OpenMP Programs","authors":"Prasanth Chatarasi, Vivek Sarkar","doi":"10.1109/PACT.2015.57","DOIUrl":"https://doi.org/10.1109/PACT.2015.57","url":null,"abstract":"The polyhedral model is a powerful algebraic framework that has enabled significant advances in analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral compilation techniques to analyze and transform explicitly parallel programs. In our PACT'15 paper titled \"Polyhedral Optimizations of Explicitly Parallel Programs\" [1, 2], we addressed the problem of analyzing and transforming programs with explicit parallelism that satisfy the serial-elision property, i.e., the property that removal of all parallel constructs results in a sequential program that is a valid (albeit inefficient) implementation of the parallel program semantics.In this poster, we address the problem of analyzing and transforming more general OpenMP programs that do not satisfy the serial-elision property. Our contributions include the following: 1) An extension of the polyhedral model to represent input OpenMP programs, 2) Formalization of May Happen in Parallel (MHP) and Happens before (HB) relations in the extended model, 3) An approach for static detection of data races in OpenMP programs by generating race constraints that can be solved by an SMT solver such as Z3, and 4) An approach for transforming OpenMP programs.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125033807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Mukhanov, Dimitrios S. Nikolopoulos, B. D. Supinski
Energy efficiency is an essential requirement for all contemporary computing systems. We thus need tools to measure the energy consumption of computing systems and to understand how workloads affect it. Significant recent research effort has targeted direct power measurements on production computing systems using on-board sensors or external instruments. These direct methods have in turn guided studies of software techniques to reduce energy consumption via workload allocation and scaling. Unfortunately, direct energymeasurementsarehamperedbythelowpowersampling frequency of power sensors. The coarse granularity of power sensing limits our understanding of how power is allocated in systems and our ability to optimize energy efficiency via workload allocation. We present ALEA, a tool to measure power and energy consumption at the granularity of basic blocks, using a probabilistic approach. ALEA provides fine-grained energy profiling via statistical sampling, which overcomes the limitations of power sensing instruments. Compared to state-of-the-art energy measurement tools, ALEA provides finer granularity without sacrificing accuracy. ALEA achieves low overhead energy measurements with mean error rates between 1.4% and 3.5% in 14 sequential and parallel benchmarks tested on both Intel and ARM platforms. The sampling method caps execution time overhead at approximately 1%. ALEA is thus suitable for online energy monitoring and optimization. Finally, ALEA is a user-space tool with a portable, machine-independent sampling method. We demonstrate three use cases of ALEA, where we reduce the energy consumption of a k-means computational kernel by 37%, an ocean modeling code by 33%, and a ray tracing code by 6% compared to high-performance execution baselines, by varying the power optimization strategy between basic blocks.
{"title":"ALEA: Fine-Grain Energy Profiling with Basic Block Sampling","authors":"L. Mukhanov, Dimitrios S. Nikolopoulos, B. D. Supinski","doi":"10.1109/PACT.2015.16","DOIUrl":"https://doi.org/10.1109/PACT.2015.16","url":null,"abstract":"Energy efficiency is an essential requirement for all contemporary computing systems. We thus need tools to measure the energy consumption of computing systems and to understand how workloads affect it. Significant recent research effort has targeted direct power measurements on production computing systems using on-board sensors or external instruments. These direct methods have in turn guided studies of software techniques to reduce energy consumption via workload allocation and scaling. Unfortunately, direct energymeasurementsarehamperedbythelowpowersampling frequency of power sensors. The coarse granularity of power sensing limits our understanding of how power is allocated in systems and our ability to optimize energy efficiency via workload allocation. We present ALEA, a tool to measure power and energy consumption at the granularity of basic blocks, using a probabilistic approach. ALEA provides fine-grained energy profiling via statistical sampling, which overcomes the limitations of power sensing instruments. Compared to state-of-the-art energy measurement tools, ALEA provides finer granularity without sacrificing accuracy. ALEA achieves low overhead energy measurements with mean error rates between 1.4% and 3.5% in 14 sequential and parallel benchmarks tested on both Intel and ARM platforms. The sampling method caps execution time overhead at approximately 1%. ALEA is thus suitable for online energy monitoring and optimization. Finally, ALEA is a user-space tool with a portable, machine-independent sampling method. We demonstrate three use cases of ALEA, where we reduce the energy consumption of a k-means computational kernel by 37%, an ocean modeling code by 33%, and a ray tracing code by 6% compared to high-performance execution baselines, by varying the power optimization strategy between basic blocks.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121253618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}