Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing最新文献

英文中文

SMT-Aware Instantaneous Footprint Optimization smt感知的瞬时内存占用优化

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907308

Probir Roy, Xu Liu, S. Song

Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the entire memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging because they typically spawn threads within Single Program Multiple Data (SPMD) models. Since these threads have similar resource requirements, their contention cannot be easily mitigated through simple thread scheduling. To address this important issue, we first vigorously conduct a systematic performance evaluation on a wide-range of representative HPC and CMP applications on three mainstream SMT architectures, and quantify their performance sensitivity to SMT effects. Then we introduce a simple scheme for SMT-aware code optimization which aims to reduce the memory contention across SMT threads. Finally, we develop a lightweight performance tool, named SMTAnalyzer, to effectively identify the optimization opportunities in the source code of multithreaded programs. Experiments on three SMT architectures (i.e., Intel Xeon, IBM POWER7, and Intel Xeon Phi) demonstrate that our proposed SMT-aware optimization scheme can significantly improve the performance for general HPC applications.

现代体系结构使用同步多线程(SMT)来增加线程级别的并行性。SMT线程共享许多功能单元和物理核心的整个内存层次结构。如果没有仔细的代码设计，SMT线程很容易相互争夺这些共享资源，从而导致严重的性能下降。对于运行在专用平台上的HPC应用程序，最小化SMT线程争用是非常具有挑战性的，因为它们通常会在单程序多数据(SPMD)模型中产生线程。由于这些线程具有相似的资源需求，因此无法通过简单的线程调度轻松缓解它们的争用。为了解决这一重要问题，我们首先对三种主流SMT架构上具有代表性的HPC和CMP应用进行了广泛的系统性能评估，并量化了它们对SMT效应的性能敏感性。然后，我们介绍了一个简单的SMT感知代码优化方案，旨在减少SMT线程之间的内存争用。最后，我们开发了一个名为SMTAnalyzer的轻量级性能工具，以有效地识别多线程程序源代码中的优化机会。在三种SMT架构(即Intel Xeon、IBM POWER7和Intel Xeon Phi)上的实验表明，我们提出的SMT感知优化方案可以显着提高一般HPC应用程序的性能。

{"title":"SMT-Aware Instantaneous Footprint Optimization","authors":"Probir Roy, Xu Liu, S. Song","doi":"10.1145/2907294.2907308","DOIUrl":"https://doi.org/10.1145/2907294.2907308","url":null,"abstract":"Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the entire memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging because they typically spawn threads within Single Program Multiple Data (SPMD) models. Since these threads have similar resource requirements, their contention cannot be easily mitigated through simple thread scheduling. To address this important issue, we first vigorously conduct a systematic performance evaluation on a wide-range of representative HPC and CMP applications on three mainstream SMT architectures, and quantify their performance sensitivity to SMT effects. Then we introduce a simple scheme for SMT-aware code optimization which aims to reduce the memory contention across SMT threads. Finally, we develop a lightweight performance tool, named SMTAnalyzer, to effectively identify the optimization opportunities in the source code of multithreaded programs. Experiments on three SMT architectures (i.e., Intel Xeon, IBM POWER7, and Intel Xeon Phi) demonstrate that our proposed SMT-aware optimization scheme can significantly improve the performance for general HPC applications.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"2 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91419059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Session details: High Performance Networks 会话详细信息:高性能网络

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/3257969

P. Balaji

引用次数: 0

With Extreme Scale Computing the Rules Have Changed 随着极端规模计算的出现，规则发生了变化

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1007/978-3-319-42432-3_1

J. Dongarra

引用次数: 8

Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory 显式管理非易失性存储器中算法导向的数据放置

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907321

Panruo Wu, Dong Li, Zizhong Chen, J. Vetter, Sparsh Mittal

The emergence of many non-volatile memory (NVM) techniques is poised to revolutionize main memory systems because of the relatively high capacity and low lifetime power consumption of NVM. However, to avoid the typical limitation of NVM as the main memory, NVM is usually combined with DRAM to form a hybrid NVM/DRAM system to gain the benefits of each. However, this integrated memory system raises a question on how to manage data placement and movement across NVM and DRAM, which is critical for maximizing the benefits of this integration. The existing solutions have several limitations, which obstruct adoption of these solutions in the high performance computing (HPC) domain. In particular, they cannot take advantage of application semantics, thus losing critical optimization opportunities and demanding extensive hardware extensions; they implement persistent semantics for resilience purpose while suffering large performance and energy overhead. In this paper, we re-examine the current hybrid memory designs from the HPC perspective, and aim to leverage the knowledge of numerical algorithms to direct data placement. With explicit algorithm management and limited hardware support, we optimize data movement between NVM and DRAM, improve data locality, and implement a relaxed memory persistency scheme in NVM. Our work demonstrates significant benefits of integrating algorithm knowledge into the hybrid memory design to achieve multi-dimensional optimization (performance, energy, and resilience) in HPC.

许多非易失性存储器(NVM)技术的出现正准备彻底改变主存储器系统，因为NVM具有相对较高的容量和较低的寿命功耗。然而，为了避免NVM作为主存的典型限制，通常将NVM与DRAM结合，形成NVM/DRAM混合系统，以获得各自的优势。然而，这种集成内存系统提出了一个问题，即如何管理跨NVM和DRAM的数据放置和移动，这对于最大化这种集成的好处至关重要。现有的解决方案存在一些局限性，阻碍了这些解决方案在高性能计算(HPC)领域的应用。特别是，它们不能利用应用程序语义，因此失去了关键的优化机会，并需要大量的硬件扩展;它们为了弹性的目的实现持久语义，同时承受很大的性能和能量开销。在本文中，我们从高性能计算的角度重新审视当前的混合存储器设计，并旨在利用数值算法的知识来指导数据放置。通过明确的算法管理和有限的硬件支持，我们优化了NVM和DRAM之间的数据移动，改善了数据的局部性，并在NVM中实现了宽松的内存持久性方案。我们的研究表明，将算法知识集成到混合存储器设计中，可以在高性能计算中实现多维优化(性能、能量和弹性)。

{"title":"Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory","authors":"Panruo Wu, Dong Li, Zizhong Chen, J. Vetter, Sparsh Mittal","doi":"10.1145/2907294.2907321","DOIUrl":"https://doi.org/10.1145/2907294.2907321","url":null,"abstract":"The emergence of many non-volatile memory (NVM) techniques is poised to revolutionize main memory systems because of the relatively high capacity and low lifetime power consumption of NVM. However, to avoid the typical limitation of NVM as the main memory, NVM is usually combined with DRAM to form a hybrid NVM/DRAM system to gain the benefits of each. However, this integrated memory system raises a question on how to manage data placement and movement across NVM and DRAM, which is critical for maximizing the benefits of this integration. The existing solutions have several limitations, which obstruct adoption of these solutions in the high performance computing (HPC) domain. In particular, they cannot take advantage of application semantics, thus losing critical optimization opportunities and demanding extensive hardware extensions; they implement persistent semantics for resilience purpose while suffering large performance and energy overhead. In this paper, we re-examine the current hybrid memory designs from the HPC perspective, and aim to leverage the knowledge of numerical algorithms to direct data placement. With explicit algorithm management and limited hardware support, we optimize data movement between NVM and DRAM, improve data locality, and implement a relaxed memory persistency scheme in NVM. Our work demonstrates significant benefits of integrating algorithm knowledge into the hybrid memory design to achieve multi-dimensional optimization (performance, energy, and resilience) in HPC.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80991788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Implications of Heterogeneous Memories in Next Generation Server Systems 异构存储器在下一代服务器系统中的应用

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2911993

Ada Gavrilovska

Next generation datacenter and exascale machines will include significantly larger amounts of memory, greater heterogeneity in the performance, persistence or sharing properties of the memory components they encompass, and increase in the relative cost and complexity of the data paths in the resulting memory topology. This poses several challenges to the systems software stacks managing these memory-centric platform designs. First, technology advances in novel memory technologies shift the data access bottlenecks into the software stack. Second, current systems software lacks capabilities to bridge the multi-dimensional non-uniformity in the memory subsystem to the dynamic nature of the workloads it must support. In addition, current memory management solutions have limited ability to explicitly reason about the costs and tradeoffs associated with data movement operations, leading to limited efficiency of their interconnect use. To address these problems, next generation systems software stacks require new data structures, abstractions and mechanisms in order to enable new levels of efficiency in the data placement, movement, and transformation decisions that govern the underlying memory use. In this talk, I will present our approach to rearchitecting systems software and services in response to both node-level and system-wide memory heterogeneity and scale, particularly concerning the presence of non-volatile memories, and will demonstrate the resulting performance and efficiency gains using several scientific and data-intensive workloads.

下一代数据中心和百亿亿级计算机将包括更大的内存量，它们所包含的内存组件的性能、持久性或共享属性的异构性更强，并且在最终内存拓扑中数据路径的相对成本和复杂性会增加。这对管理这些以内存为中心的平台设计的系统软件堆栈提出了几个挑战。首先，新存储技术的技术进步将数据访问瓶颈转移到软件堆栈中。其次，当前的系统软件缺乏将内存子系统中的多维非一致性与它必须支持的工作负载的动态特性连接起来的能力。此外，当前的内存管理解决方案在明确推断与数据移动操作相关的成本和权衡方面的能力有限，导致其互连使用的效率有限。为了解决这些问题，下一代系统软件栈需要新的数据结构、抽象和机制，以便在控制底层内存使用的数据放置、移动和转换决策中实现新的效率水平。在这次演讲中，我将介绍我们的方法来重新架构系统软件和服务，以响应节点级和系统范围的内存异质性和规模，特别是关于非易失性存储器的存在，并将展示使用几个科学和数据密集型工作负载所产生的性能和效率收益。

{"title":"Implications of Heterogeneous Memories in Next Generation Server Systems","authors":"Ada Gavrilovska","doi":"10.1145/2907294.2911993","DOIUrl":"https://doi.org/10.1145/2907294.2911993","url":null,"abstract":"Next generation datacenter and exascale machines will include significantly larger amounts of memory, greater heterogeneity in the performance, persistence or sharing properties of the memory components they encompass, and increase in the relative cost and complexity of the data paths in the resulting memory topology. This poses several challenges to the systems software stacks managing these memory-centric platform designs. First, technology advances in novel memory technologies shift the data access bottlenecks into the software stack. Second, current systems software lacks capabilities to bridge the multi-dimensional non-uniformity in the memory subsystem to the dynamic nature of the workloads it must support. In addition, current memory management solutions have limited ability to explicitly reason about the costs and tradeoffs associated with data movement operations, leading to limited efficiency of their interconnect use. To address these problems, next generation systems software stacks require new data structures, abstractions and mechanisms in order to enable new levels of efficiency in the data placement, movement, and transformation decisions that govern the underlying memory use. In this talk, I will present our approach to rearchitecting systems software and services in response to both node-level and system-wide memory heterogeneity and scale, particularly concerning the presence of non-volatile memories, and will demonstrate the resulting performance and efficiency gains using several scientific and data-intensive workloads.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81842240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-configuring Software-defined Overlay Bypass for Seamless Inter- and Intra-cloud Virtual Networking 云间和云内虚拟网络的自配置软件定义覆盖旁路

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907318

Kyu-Young Jeong, R. Figueiredo

Many techniques have been proposed to provide, transparently, the abstraction of a layer-2 virtual network environment within a provider, e.g. by leveraging Software-Defined Networking (SDN). However, cloud providers often constrain layer-2 communication across instances; furthermore, SDN integration and layer-2 messaging between distinct domains distributed across the Internet is not possible, hindering the ability for tenants to deploy their virtual networks across providers. In contrast, overlay networks provide a flexible foundation for inter-cloud virtual private networking (VPN), by tunneling virtual network traffic through private, authenticated end-to-end overlay links. However, overlays inherently incur network virtualization overheads, including header encapsulation and user/kernel boundary crossing. This paper proposes a novel system -- VIAS (VIrtualization Acceleration over SDN) -- that delivers the flexibility of overlays for inter-cloud virtual private networking, while transparently applying SDN techniques (available in existing OpenFlow hardware or software switches) to selectively bypass overlay tunneling and achieve near-native performance for TCP/UDP flows within a provider. Architecturally, VIAS is unique in how it integrates SDN and overlay controllers in a distributed fashion to coordinate the management of virtual network links and flows. The approach is self-organizing, whereby overlay nodes can detect that peer endpoints are in the same network and program bypass flows between OpenFlow switches. While generally applicable, VIAS in particular applies to nested VMs/containers across cloud providers, supporting seamless communication within and across providers. VIAS has been implemented as an extension to an existing virtual network overlay platform (IP-over-P2P, IPOP) by integrating OpenFlow controller functionality with distributed overlay controllers. We evaluate the performance of VIAS in realistic cloud environments using an implementation based on IPOP, the RYU SDN framework, Open vSwitch, and LXC containers across various cloud environment including Amazon, Google compute engine, and CloudLab.

已经提出了许多技术来透明地提供提供者内部的第二层虚拟网络环境的抽象，例如通过利用软件定义网络(SDN)。然而，云提供商经常限制跨实例的第二层通信;此外，SDN集成和跨Internet分布的不同域之间的第2层消息传递是不可能的，这阻碍了租户跨提供商部署其虚拟网络的能力。相比之下，覆盖网络为云间虚拟专用网(VPN)提供了一个灵活的基础，它通过私有的、经过认证的端到端覆盖链路对虚拟网络流量进行隧道化。然而，覆盖本身会导致网络虚拟化开销，包括报头封装和用户/内核边界跨越。本文提出了一种新颖的系统——VIAS(虚拟化加速SDN)——它为云间虚拟专用网络提供了覆盖的灵活性，同时透明地应用SDN技术(可在现有的OpenFlow硬件或软件交换机中使用)来选择性地绕过覆盖隧道，并为提供商内的TCP/UDP流实现接近本地的性能。在架构上，VIAS的独特之处在于它以分布式方式集成SDN和覆盖控制器，以协调虚拟网络链路和流的管理。该方法是自组织的，即覆盖节点可以检测对等端点在同一网络中，并在OpenFlow交换机之间编程旁路流。虽然普遍适用，但VIAS特别适用于跨云提供商的嵌套vm /容器，支持提供商内部和提供商之间的无缝通信。VIAS通过集成OpenFlow控制器功能和分布式覆盖控制器，作为现有虚拟网络覆盖平台(IP-over-P2P, IPOP)的扩展而实现。我们使用基于IPOP、RYU SDN框架、Open vSwitch和跨各种云环境(包括Amazon、b谷歌计算引擎和CloudLab)的LXC容器的实现来评估VIAS在现实云环境中的性能。

{"title":"Self-configuring Software-defined Overlay Bypass for Seamless Inter- and Intra-cloud Virtual Networking","authors":"Kyu-Young Jeong, R. Figueiredo","doi":"10.1145/2907294.2907318","DOIUrl":"https://doi.org/10.1145/2907294.2907318","url":null,"abstract":"Many techniques have been proposed to provide, transparently, the abstraction of a layer-2 virtual network environment within a provider, e.g. by leveraging Software-Defined Networking (SDN). However, cloud providers often constrain layer-2 communication across instances; furthermore, SDN integration and layer-2 messaging between distinct domains distributed across the Internet is not possible, hindering the ability for tenants to deploy their virtual networks across providers. In contrast, overlay networks provide a flexible foundation for inter-cloud virtual private networking (VPN), by tunneling virtual network traffic through private, authenticated end-to-end overlay links. However, overlays inherently incur network virtualization overheads, including header encapsulation and user/kernel boundary crossing. This paper proposes a novel system -- VIAS (VIrtualization Acceleration over SDN) -- that delivers the flexibility of overlays for inter-cloud virtual private networking, while transparently applying SDN techniques (available in existing OpenFlow hardware or software switches) to selectively bypass overlay tunneling and achieve near-native performance for TCP/UDP flows within a provider. Architecturally, VIAS is unique in how it integrates SDN and overlay controllers in a distributed fashion to coordinate the management of virtual network links and flows. The approach is self-organizing, whereby overlay nodes can detect that peer endpoints are in the same network and program bypass flows between OpenFlow switches. While generally applicable, VIAS in particular applies to nested VMs/containers across cloud providers, supporting seamless communication within and across providers. VIAS has been implemented as an extension to an existing virtual network overlay platform (IP-over-P2P, IPOP) by integrating OpenFlow controller functionality with distributed overlay controllers. We evaluate the performance of VIAS in realistic cloud environments using an implementation based on IPOP, the RYU SDN framework, Open vSwitch, and LXC containers across various cloud environment including Amazon, Google compute engine, and CloudLab.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85405468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Session details: Cloud and Resource Management 会话详细信息:云和资源管理

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/3257974

Ming Zhao

引用次数: 0

SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform SWAT:一个可编程的、内存中的、分布式的高性能计算平台

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907307

M. Grossman, Vivek Sarkar

The field of data analytics is currently going through a renaissance as a result of ever-increasing dataset sizes, the value of the models that can be trained from those datasets, and a surge in flexible, distributed programming models. In particular, the Apache Hadoop and Spark programming systems, as well as their supporting projects (e.g. HDFS, SparkSQL), have greatly simplified the analysis and transformation of datasets whose size exceeds the capacity of a single machine. While these programming models facilitate the use of distributed systems to analyze large datasets, they have been plagued by performance issues. The I/O performance bottlenecks of Hadoop are partially responsible for the creation of Spark. Performance bottlenecks in Spark due to the JVM object model, garbage collection, interpreted/managed execution, and other abstraction layers are responsible for the creation of additional optimization layers, such as Project Tungsten. Indeed, the Project Tungsten issue tracker states that the "majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory". In this work, we address the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators. We refer to our approach as Spark With Accelerated Tasks (SWAT). SWAT is an accelerated data analytics (ADA) framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors, while continuing to write their applications in a JVM-based language like Java or Scala. Runtime code generation creates OpenCL kernels from JVM bytecode, which are then executed on OpenCL accelerators. In our work we emphasize 1) full compatibility with a modern, existing, and accepted data analytics platform, 2) an asynchronous, event-driven, and resource-aware runtime, 3) multi-GPU memory management and caching, and 4) ease-of-use and programmability. Our performance evaluation demonstrates up to 3.24x overall application speedup relative to Spark across six machine learning benchmarks, with a detailed investigation of these performance improvements.

数据分析领域目前正在经历一场复兴，因为数据集规模不断增加，可以从这些数据集中训练的模型的价值，以及灵活的分布式编程模型的激增。特别是，Apache Hadoop和Spark编程系统，以及它们的支持项目(例如HDFS, SparkSQL)，极大地简化了数据集的分析和转换，这些数据集的大小超过了单个机器的容量。虽然这些编程模型有助于使用分布式系统分析大型数据集，但它们一直受到性能问题的困扰。Hadoop的I/O性能瓶颈是创建Spark的部分原因。由于JVM对象模型、垃圾收集、解释/托管执行和其他抽象层，Spark中的性能瓶颈负责创建额外的优化层，例如Project Tungsten。事实上，Project Tungsten问题跟踪器指出，“大多数Spark工作负载的瓶颈不是I/O或网络，而是CPU和内存”。在这项工作中，我们通过使用加速器加速用户编写的计算内核来解决Apache Spark中存在的CPU和内存性能瓶颈。我们将我们的方法称为Spark With Accelerated Tasks (SWAT)。SWAT是一个加速数据分析(ADA)框架，它使程序员能够在带有协处理器的高性能硬件平台上本地执行Spark应用程序，同时继续使用基于jvm的语言(如Java或Scala)编写应用程序。运行时代码生成从JVM字节码创建OpenCL内核，然后在OpenCL加速器上执行。在我们的工作中，我们强调1)与现代的、现有的和公认的数据分析平台的完全兼容性，2)异步的、事件驱动的和资源感知的运行时，3)多gpu内存管理和缓存，以及4)易用性和可编程性。我们的性能评估显示，在六个机器学习基准测试中，相对于Spark，应用程序的整体速度提高了3.24倍，并对这些性能改进进行了详细的调查。

{"title":"SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform","authors":"M. Grossman, Vivek Sarkar","doi":"10.1145/2907294.2907307","DOIUrl":"https://doi.org/10.1145/2907294.2907307","url":null,"abstract":"The field of data analytics is currently going through a renaissance as a result of ever-increasing dataset sizes, the value of the models that can be trained from those datasets, and a surge in flexible, distributed programming models. In particular, the Apache Hadoop and Spark programming systems, as well as their supporting projects (e.g. HDFS, SparkSQL), have greatly simplified the analysis and transformation of datasets whose size exceeds the capacity of a single machine. While these programming models facilitate the use of distributed systems to analyze large datasets, they have been plagued by performance issues. The I/O performance bottlenecks of Hadoop are partially responsible for the creation of Spark. Performance bottlenecks in Spark due to the JVM object model, garbage collection, interpreted/managed execution, and other abstraction layers are responsible for the creation of additional optimization layers, such as Project Tungsten. Indeed, the Project Tungsten issue tracker states that the \"majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory\". In this work, we address the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators. We refer to our approach as Spark With Accelerated Tasks (SWAT). SWAT is an accelerated data analytics (ADA) framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors, while continuing to write their applications in a JVM-based language like Java or Scala. Runtime code generation creates OpenCL kernels from JVM bytecode, which are then executed on OpenCL accelerators. In our work we emphasize 1) full compatibility with a modern, existing, and accepted data analytics platform, 2) an asynchronous, event-driven, and resource-aware runtime, 3) multi-GPU memory management and caching, and 4) ease-of-use and programmability. Our performance evaluation demonstrates up to 3.24x overall application speedup relative to Spark across six machine learning benchmarks, with a detailed investigation of these performance improvements.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75880588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Session details: Keynote Address 会议详情:主题演讲

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/3257968

Jack Lange

引用次数: 0

IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism IMPACC:一个紧密集成的MPI+OpenACC框架，利用共享内存并行性

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907302

Jungwon Kim, Seyong Lee, J. Vetter

We propose IMPACC, an MPI+OpenACC framework for heterogeneous accelerator clusters. IMPACC tightly integrates MPI and OpenACC, while exploiting the shared memory parallelism in the target system. IMPACC dynamically adapts the input MPI+OpenACC applications on the target heterogeneous accelerator clusters to fully exploit target system-specific features. IMPACC provides the programmers with the unified virtual address space, automatic NUMA-friendly task-device mapping, efficient integrated communication routines, seamless streamlining of asynchronous executions, and transparent memory sharing. We have implemented IMPACC and evaluated its performance using three heterogeneous accelerator systems, including Titan supercomputer. Results show that IMPACC can achieve easier programming, higher performance, and better scalability than the current MPI+OpenACC model.

我们提出了IMPACC，一个MPI+OpenACC的异构加速器集群框架。IMPACC紧密地集成了MPI和OpenACC，同时利用了目标系统中的共享内存并行性。IMPACC动态地调整目标异构加速器集群上的输入MPI+OpenACC应用程序，以充分利用目标系统特定的特性。IMPACC为程序员提供了统一的虚拟地址空间、自动的numa友好型任务-设备映射、高效的集成通信例程、异步执行的无缝流线化以及透明的内存共享。我们使用包括Titan超级计算机在内的三种异构加速器系统实现了IMPACC，并对其性能进行了评估。结果表明，与现有的MPI+OpenACC模型相比，IMPACC可以实现更简单的编程、更高的性能和更好的可扩展性。

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀