Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the entire memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging because they typically spawn threads within Single Program Multiple Data (SPMD) models. Since these threads have similar resource requirements, their contention cannot be easily mitigated through simple thread scheduling. To address this important issue, we first vigorously conduct a systematic performance evaluation on a wide-range of representative HPC and CMP applications on three mainstream SMT architectures, and quantify their performance sensitivity to SMT effects. Then we introduce a simple scheme for SMT-aware code optimization which aims to reduce the memory contention across SMT threads. Finally, we develop a lightweight performance tool, named SMTAnalyzer, to effectively identify the optimization opportunities in the source code of multithreaded programs. Experiments on three SMT architectures (i.e., Intel Xeon, IBM POWER7, and Intel Xeon Phi) demonstrate that our proposed SMT-aware optimization scheme can significantly improve the performance for general HPC applications.
{"title":"SMT-Aware Instantaneous Footprint Optimization","authors":"Probir Roy, Xu Liu, S. Song","doi":"10.1145/2907294.2907308","DOIUrl":"https://doi.org/10.1145/2907294.2907308","url":null,"abstract":"Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the entire memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging because they typically spawn threads within Single Program Multiple Data (SPMD) models. Since these threads have similar resource requirements, their contention cannot be easily mitigated through simple thread scheduling. To address this important issue, we first vigorously conduct a systematic performance evaluation on a wide-range of representative HPC and CMP applications on three mainstream SMT architectures, and quantify their performance sensitivity to SMT effects. Then we introduce a simple scheme for SMT-aware code optimization which aims to reduce the memory contention across SMT threads. Finally, we develop a lightweight performance tool, named SMTAnalyzer, to effectively identify the optimization opportunities in the source code of multithreaded programs. Experiments on three SMT architectures (i.e., Intel Xeon, IBM POWER7, and Intel Xeon Phi) demonstrate that our proposed SMT-aware optimization scheme can significantly improve the performance for general HPC applications.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"2 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91419059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: High Performance Networks","authors":"P. Balaji","doi":"10.1145/3257969","DOIUrl":"https://doi.org/10.1145/3257969","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74968796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-31DOI: 10.1007/978-3-319-42432-3_1
J. Dongarra
{"title":"With Extreme Scale Computing the Rules Have Changed","authors":"J. Dongarra","doi":"10.1007/978-3-319-42432-3_1","DOIUrl":"https://doi.org/10.1007/978-3-319-42432-3_1","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80690988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panruo Wu, Dong Li, Zizhong Chen, J. Vetter, Sparsh Mittal
The emergence of many non-volatile memory (NVM) techniques is poised to revolutionize main memory systems because of the relatively high capacity and low lifetime power consumption of NVM. However, to avoid the typical limitation of NVM as the main memory, NVM is usually combined with DRAM to form a hybrid NVM/DRAM system to gain the benefits of each. However, this integrated memory system raises a question on how to manage data placement and movement across NVM and DRAM, which is critical for maximizing the benefits of this integration. The existing solutions have several limitations, which obstruct adoption of these solutions in the high performance computing (HPC) domain. In particular, they cannot take advantage of application semantics, thus losing critical optimization opportunities and demanding extensive hardware extensions; they implement persistent semantics for resilience purpose while suffering large performance and energy overhead. In this paper, we re-examine the current hybrid memory designs from the HPC perspective, and aim to leverage the knowledge of numerical algorithms to direct data placement. With explicit algorithm management and limited hardware support, we optimize data movement between NVM and DRAM, improve data locality, and implement a relaxed memory persistency scheme in NVM. Our work demonstrates significant benefits of integrating algorithm knowledge into the hybrid memory design to achieve multi-dimensional optimization (performance, energy, and resilience) in HPC.
{"title":"Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory","authors":"Panruo Wu, Dong Li, Zizhong Chen, J. Vetter, Sparsh Mittal","doi":"10.1145/2907294.2907321","DOIUrl":"https://doi.org/10.1145/2907294.2907321","url":null,"abstract":"The emergence of many non-volatile memory (NVM) techniques is poised to revolutionize main memory systems because of the relatively high capacity and low lifetime power consumption of NVM. However, to avoid the typical limitation of NVM as the main memory, NVM is usually combined with DRAM to form a hybrid NVM/DRAM system to gain the benefits of each. However, this integrated memory system raises a question on how to manage data placement and movement across NVM and DRAM, which is critical for maximizing the benefits of this integration. The existing solutions have several limitations, which obstruct adoption of these solutions in the high performance computing (HPC) domain. In particular, they cannot take advantage of application semantics, thus losing critical optimization opportunities and demanding extensive hardware extensions; they implement persistent semantics for resilience purpose while suffering large performance and energy overhead. In this paper, we re-examine the current hybrid memory designs from the HPC perspective, and aim to leverage the knowledge of numerical algorithms to direct data placement. With explicit algorithm management and limited hardware support, we optimize data movement between NVM and DRAM, improve data locality, and implement a relaxed memory persistency scheme in NVM. Our work demonstrates significant benefits of integrating algorithm knowledge into the hybrid memory design to achieve multi-dimensional optimization (performance, energy, and resilience) in HPC.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80991788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Next generation datacenter and exascale machines will include significantly larger amounts of memory, greater heterogeneity in the performance, persistence or sharing properties of the memory components they encompass, and increase in the relative cost and complexity of the data paths in the resulting memory topology. This poses several challenges to the systems software stacks managing these memory-centric platform designs. First, technology advances in novel memory technologies shift the data access bottlenecks into the software stack. Second, current systems software lacks capabilities to bridge the multi-dimensional non-uniformity in the memory subsystem to the dynamic nature of the workloads it must support. In addition, current memory management solutions have limited ability to explicitly reason about the costs and tradeoffs associated with data movement operations, leading to limited efficiency of their interconnect use. To address these problems, next generation systems software stacks require new data structures, abstractions and mechanisms in order to enable new levels of efficiency in the data placement, movement, and transformation decisions that govern the underlying memory use. In this talk, I will present our approach to rearchitecting systems software and services in response to both node-level and system-wide memory heterogeneity and scale, particularly concerning the presence of non-volatile memories, and will demonstrate the resulting performance and efficiency gains using several scientific and data-intensive workloads.
{"title":"Implications of Heterogeneous Memories in Next Generation Server Systems","authors":"Ada Gavrilovska","doi":"10.1145/2907294.2911993","DOIUrl":"https://doi.org/10.1145/2907294.2911993","url":null,"abstract":"Next generation datacenter and exascale machines will include significantly larger amounts of memory, greater heterogeneity in the performance, persistence or sharing properties of the memory components they encompass, and increase in the relative cost and complexity of the data paths in the resulting memory topology. This poses several challenges to the systems software stacks managing these memory-centric platform designs. First, technology advances in novel memory technologies shift the data access bottlenecks into the software stack. Second, current systems software lacks capabilities to bridge the multi-dimensional non-uniformity in the memory subsystem to the dynamic nature of the workloads it must support. In addition, current memory management solutions have limited ability to explicitly reason about the costs and tradeoffs associated with data movement operations, leading to limited efficiency of their interconnect use. To address these problems, next generation systems software stacks require new data structures, abstractions and mechanisms in order to enable new levels of efficiency in the data placement, movement, and transformation decisions that govern the underlying memory use. In this talk, I will present our approach to rearchitecting systems software and services in response to both node-level and system-wide memory heterogeneity and scale, particularly concerning the presence of non-volatile memories, and will demonstrate the resulting performance and efficiency gains using several scientific and data-intensive workloads.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81842240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many techniques have been proposed to provide, transparently, the abstraction of a layer-2 virtual network environment within a provider, e.g. by leveraging Software-Defined Networking (SDN). However, cloud providers often constrain layer-2 communication across instances; furthermore, SDN integration and layer-2 messaging between distinct domains distributed across the Internet is not possible, hindering the ability for tenants to deploy their virtual networks across providers. In contrast, overlay networks provide a flexible foundation for inter-cloud virtual private networking (VPN), by tunneling virtual network traffic through private, authenticated end-to-end overlay links. However, overlays inherently incur network virtualization overheads, including header encapsulation and user/kernel boundary crossing. This paper proposes a novel system -- VIAS (VIrtualization Acceleration over SDN) -- that delivers the flexibility of overlays for inter-cloud virtual private networking, while transparently applying SDN techniques (available in existing OpenFlow hardware or software switches) to selectively bypass overlay tunneling and achieve near-native performance for TCP/UDP flows within a provider. Architecturally, VIAS is unique in how it integrates SDN and overlay controllers in a distributed fashion to coordinate the management of virtual network links and flows. The approach is self-organizing, whereby overlay nodes can detect that peer endpoints are in the same network and program bypass flows between OpenFlow switches. While generally applicable, VIAS in particular applies to nested VMs/containers across cloud providers, supporting seamless communication within and across providers. VIAS has been implemented as an extension to an existing virtual network overlay platform (IP-over-P2P, IPOP) by integrating OpenFlow controller functionality with distributed overlay controllers. We evaluate the performance of VIAS in realistic cloud environments using an implementation based on IPOP, the RYU SDN framework, Open vSwitch, and LXC containers across various cloud environment including Amazon, Google compute engine, and CloudLab.
{"title":"Self-configuring Software-defined Overlay Bypass for Seamless Inter- and Intra-cloud Virtual Networking","authors":"Kyu-Young Jeong, R. Figueiredo","doi":"10.1145/2907294.2907318","DOIUrl":"https://doi.org/10.1145/2907294.2907318","url":null,"abstract":"Many techniques have been proposed to provide, transparently, the abstraction of a layer-2 virtual network environment within a provider, e.g. by leveraging Software-Defined Networking (SDN). However, cloud providers often constrain layer-2 communication across instances; furthermore, SDN integration and layer-2 messaging between distinct domains distributed across the Internet is not possible, hindering the ability for tenants to deploy their virtual networks across providers. In contrast, overlay networks provide a flexible foundation for inter-cloud virtual private networking (VPN), by tunneling virtual network traffic through private, authenticated end-to-end overlay links. However, overlays inherently incur network virtualization overheads, including header encapsulation and user/kernel boundary crossing. This paper proposes a novel system -- VIAS (VIrtualization Acceleration over SDN) -- that delivers the flexibility of overlays for inter-cloud virtual private networking, while transparently applying SDN techniques (available in existing OpenFlow hardware or software switches) to selectively bypass overlay tunneling and achieve near-native performance for TCP/UDP flows within a provider. Architecturally, VIAS is unique in how it integrates SDN and overlay controllers in a distributed fashion to coordinate the management of virtual network links and flows. The approach is self-organizing, whereby overlay nodes can detect that peer endpoints are in the same network and program bypass flows between OpenFlow switches. While generally applicable, VIAS in particular applies to nested VMs/containers across cloud providers, supporting seamless communication within and across providers. VIAS has been implemented as an extension to an existing virtual network overlay platform (IP-over-P2P, IPOP) by integrating OpenFlow controller functionality with distributed overlay controllers. We evaluate the performance of VIAS in realistic cloud environments using an implementation based on IPOP, the RYU SDN framework, Open vSwitch, and LXC containers across various cloud environment including Amazon, Google compute engine, and CloudLab.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85405468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Cloud and Resource Management","authors":"Ming Zhao","doi":"10.1145/3257974","DOIUrl":"https://doi.org/10.1145/3257974","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"4 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91417381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The field of data analytics is currently going through a renaissance as a result of ever-increasing dataset sizes, the value of the models that can be trained from those datasets, and a surge in flexible, distributed programming models. In particular, the Apache Hadoop and Spark programming systems, as well as their supporting projects (e.g. HDFS, SparkSQL), have greatly simplified the analysis and transformation of datasets whose size exceeds the capacity of a single machine. While these programming models facilitate the use of distributed systems to analyze large datasets, they have been plagued by performance issues. The I/O performance bottlenecks of Hadoop are partially responsible for the creation of Spark. Performance bottlenecks in Spark due to the JVM object model, garbage collection, interpreted/managed execution, and other abstraction layers are responsible for the creation of additional optimization layers, such as Project Tungsten. Indeed, the Project Tungsten issue tracker states that the "majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory". In this work, we address the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators. We refer to our approach as Spark With Accelerated Tasks (SWAT). SWAT is an accelerated data analytics (ADA) framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors, while continuing to write their applications in a JVM-based language like Java or Scala. Runtime code generation creates OpenCL kernels from JVM bytecode, which are then executed on OpenCL accelerators. In our work we emphasize 1) full compatibility with a modern, existing, and accepted data analytics platform, 2) an asynchronous, event-driven, and resource-aware runtime, 3) multi-GPU memory management and caching, and 4) ease-of-use and programmability. Our performance evaluation demonstrates up to 3.24x overall application speedup relative to Spark across six machine learning benchmarks, with a detailed investigation of these performance improvements.
数据分析领域目前正在经历一场复兴,因为数据集规模不断增加,可以从这些数据集中训练的模型的价值,以及灵活的分布式编程模型的激增。特别是,Apache Hadoop和Spark编程系统,以及它们的支持项目(例如HDFS, SparkSQL),极大地简化了数据集的分析和转换,这些数据集的大小超过了单个机器的容量。虽然这些编程模型有助于使用分布式系统分析大型数据集,但它们一直受到性能问题的困扰。Hadoop的I/O性能瓶颈是创建Spark的部分原因。由于JVM对象模型、垃圾收集、解释/托管执行和其他抽象层,Spark中的性能瓶颈负责创建额外的优化层,例如Project Tungsten。事实上,Project Tungsten问题跟踪器指出,“大多数Spark工作负载的瓶颈不是I/O或网络,而是CPU和内存”。在这项工作中,我们通过使用加速器加速用户编写的计算内核来解决Apache Spark中存在的CPU和内存性能瓶颈。我们将我们的方法称为Spark With Accelerated Tasks (SWAT)。SWAT是一个加速数据分析(ADA)框架,它使程序员能够在带有协处理器的高性能硬件平台上本地执行Spark应用程序,同时继续使用基于jvm的语言(如Java或Scala)编写应用程序。运行时代码生成从JVM字节码创建OpenCL内核,然后在OpenCL加速器上执行。在我们的工作中,我们强调1)与现代的、现有的和公认的数据分析平台的完全兼容性,2)异步的、事件驱动的和资源感知的运行时,3)多gpu内存管理和缓存,以及4)易用性和可编程性。我们的性能评估显示,在六个机器学习基准测试中,相对于Spark,应用程序的整体速度提高了3.24倍,并对这些性能改进进行了详细的调查。
{"title":"SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform","authors":"M. Grossman, Vivek Sarkar","doi":"10.1145/2907294.2907307","DOIUrl":"https://doi.org/10.1145/2907294.2907307","url":null,"abstract":"The field of data analytics is currently going through a renaissance as a result of ever-increasing dataset sizes, the value of the models that can be trained from those datasets, and a surge in flexible, distributed programming models. In particular, the Apache Hadoop and Spark programming systems, as well as their supporting projects (e.g. HDFS, SparkSQL), have greatly simplified the analysis and transformation of datasets whose size exceeds the capacity of a single machine. While these programming models facilitate the use of distributed systems to analyze large datasets, they have been plagued by performance issues. The I/O performance bottlenecks of Hadoop are partially responsible for the creation of Spark. Performance bottlenecks in Spark due to the JVM object model, garbage collection, interpreted/managed execution, and other abstraction layers are responsible for the creation of additional optimization layers, such as Project Tungsten. Indeed, the Project Tungsten issue tracker states that the \"majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory\". In this work, we address the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators. We refer to our approach as Spark With Accelerated Tasks (SWAT). SWAT is an accelerated data analytics (ADA) framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors, while continuing to write their applications in a JVM-based language like Java or Scala. Runtime code generation creates OpenCL kernels from JVM bytecode, which are then executed on OpenCL accelerators. In our work we emphasize 1) full compatibility with a modern, existing, and accepted data analytics platform, 2) an asynchronous, event-driven, and resource-aware runtime, 3) multi-GPU memory management and caching, and 4) ease-of-use and programmability. Our performance evaluation demonstrates up to 3.24x overall application speedup relative to Spark across six machine learning benchmarks, with a detailed investigation of these performance improvements.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75880588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote Address","authors":"Jack Lange","doi":"10.1145/3257968","DOIUrl":"https://doi.org/10.1145/3257968","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76229097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose IMPACC, an MPI+OpenACC framework for heterogeneous accelerator clusters. IMPACC tightly integrates MPI and OpenACC, while exploiting the shared memory parallelism in the target system. IMPACC dynamically adapts the input MPI+OpenACC applications on the target heterogeneous accelerator clusters to fully exploit target system-specific features. IMPACC provides the programmers with the unified virtual address space, automatic NUMA-friendly task-device mapping, efficient integrated communication routines, seamless streamlining of asynchronous executions, and transparent memory sharing. We have implemented IMPACC and evaluated its performance using three heterogeneous accelerator systems, including Titan supercomputer. Results show that IMPACC can achieve easier programming, higher performance, and better scalability than the current MPI+OpenACC model.
{"title":"IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism","authors":"Jungwon Kim, Seyong Lee, J. Vetter","doi":"10.1145/2907294.2907302","DOIUrl":"https://doi.org/10.1145/2907294.2907302","url":null,"abstract":"We propose IMPACC, an MPI+OpenACC framework for heterogeneous accelerator clusters. IMPACC tightly integrates MPI and OpenACC, while exploiting the shared memory parallelism in the target system. IMPACC dynamically adapts the input MPI+OpenACC applications on the target heterogeneous accelerator clusters to fully exploit target system-specific features. IMPACC provides the programmers with the unified virtual address space, automatic NUMA-friendly task-device mapping, efficient integrated communication routines, seamless streamlining of asynchronous executions, and transparent memory sharing. We have implemented IMPACC and evaluated its performance using three heterogeneous accelerator systems, including Titan supercomputer. Results show that IMPACC can achieve easier programming, higher performance, and better scalability than the current MPI+OpenACC model.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"89 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80388217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}