2014 21st International Conference on High Performance Computing (HiPC)最新文献_第4页

Cache-conscious scheduling of streaming pipelines on parallel machines with private caches 具有私有缓存的并行机器上流管道的缓存意识调度

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116893

Kunal Agrawal, Jordyn C. Maglalang, Jeremy T. Fineman

This paper studies the problem of scheduling a streaming pipeline on a multicore machine with private caches to maximize throughput. The theoretical contribution includes lower and upper bounds in the parallel external-memory model. We show that a simple greedy scheduling strategy is asymptotically optimal with a constant-factor memory augmentation. More specifically, we show that if our strategy has a running time of Q cache misses on a machine with size-M caches, then every “static” scheduling policy must have time at least that of Q(Q) cache misses on a machine with size-M/6 caches. Our experimental study considers the question of whether scheduling based on cache effects is more important than scheduling based on only the number of computation steps. Using synthetic pipelines with a range of parameters, we compare our cache-based partitioning against several other static schedulers that load-balance computation. In most cases, the cache-based partitioning indeed beats the other schedulers, but there are some cases that go the other way. We conclude that considering cache effects is a good idea, but other features of the streaming pipeline are also important.

研究了在多核机器上使用私有缓存实现流管道最大吞吐量的调度问题。理论贡献包括并行外部存储器模型的下界和上界。我们证明了一个简单的贪婪调度策略是渐进最优的，具有恒定因子的内存增加。更具体地说，我们表明，如果我们的策略在一个大小为m的机器上有Q个缓存丢失的运行时间，那么每个“静态”调度策略在一个大小为m /6的机器上必须至少有Q(Q)个缓存丢失的时间。我们的实验研究考虑了基于缓存效果的调度是否比仅基于计算步数的调度更重要的问题。使用具有一系列参数的合成管道，我们将基于缓存的分区与其他几个负载平衡计算的静态调度器进行比较。在大多数情况下，基于缓存的分区确实优于其他调度器，但也有一些情况相反。我们得出结论，考虑缓存效果是一个好主意，但流媒体管道的其他特性也很重要。

{"title":"Cache-conscious scheduling of streaming pipelines on parallel machines with private caches","authors":"Kunal Agrawal, Jordyn C. Maglalang, Jeremy T. Fineman","doi":"10.1109/HiPC.2014.7116893","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116893","url":null,"abstract":"This paper studies the problem of scheduling a streaming pipeline on a multicore machine with private caches to maximize throughput. The theoretical contribution includes lower and upper bounds in the parallel external-memory model. We show that a simple greedy scheduling strategy is asymptotically optimal with a constant-factor memory augmentation. More specifically, we show that if our strategy has a running time of Q cache misses on a machine with size-M caches, then every “static” scheduling policy must have time at least that of Q(Q) cache misses on a machine with size-M/6 caches. Our experimental study considers the question of whether scheduling based on cache effects is more important than scheduling based on only the number of computation steps. Using synthetic pipelines with a range of parameters, we compare our cache-based partitioning against several other static schedulers that load-balance computation. In most cases, the cache-based partitioning indeed beats the other schedulers, but there are some cases that go the other way. We conclude that considering cache effects is a good idea, but other features of the streaming pipeline are also important.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

CQA: A code quality analyzer tool at binary level 二进制级别的代码质量分析工具

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116904

Andres Charif Rubial, Emmanuel Oseret, Jose Noudohouenou, W. Jalby, G. Lartigue

Most of today's performance analysis tools are focused on issues occurring at multi-core and communication level. However there are several reasons why an application may not correctly behave in terms of performance at the core level. For a significant part, loops in industrial applications are limited by the quality of the code generated by the compiler and do not always fully benefit from the available computing power of recent processors. For instance, when the compiler is not able to vectorize loops, up to a 8x factor can be lost. It is essential to first validate the core level performance before focusing on higher level issues. This paper presents the CQA tool, a loop-centric code quality analyzer based on a simplified unicore architecture performance modeling and on quality metrics. The tool analyzes the quality of the code generated by the compiler. It provides high level metrics along with human understandable reports that relates to source code. Our performance model assumes that all data are resident in the first level cache. It provides architectural bottlenecks and an estimation of the number of cycles spent in each iteration of a given innermost loop. Our modeling and analyses are statically done and requires no execution or recompilation of the application. We show practical examples of situations where our tool is able to provide very valuable information leading to a performance gain.

今天的大多数性能分析工具都集中在多核和通信级别发生的问题上。然而，就核心级别的性能而言，有几个原因可能导致应用程序不能正确运行。在很大程度上，工业应用程序中的循环受到编译器生成的代码质量的限制，并且并不总是完全受益于最新处理器的可用计算能力。例如，当编译器不能向量化循环时，可能会丢失高达8倍的因子。在关注更高级别的问题之前，必须首先验证核心级别的性能。本文介绍了CQA工具，一个基于简化的单核架构性能建模和质量度量的以循环为中心的代码质量分析器。该工具分析编译器生成的代码的质量。它提供了高层次的度量以及与源代码相关的人类可理解的报告。我们的性能模型假设所有数据都驻留在第一级缓存中。它提供了架构瓶颈，并估计了给定最内层循环的每次迭代所花费的周期数。我们的建模和分析是静态完成的，不需要执行或重新编译应用程序。我们展示了一些实际的例子，在这些情况下，我们的工具能够提供非常有价值的信息，从而提高性能。

{"title":"CQA: A code quality analyzer tool at binary level","authors":"Andres Charif Rubial, Emmanuel Oseret, Jose Noudohouenou, W. Jalby, G. Lartigue","doi":"10.1109/HiPC.2014.7116904","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116904","url":null,"abstract":"Most of today's performance analysis tools are focused on issues occurring at multi-core and communication level. However there are several reasons why an application may not correctly behave in terms of performance at the core level. For a significant part, loops in industrial applications are limited by the quality of the code generated by the compiler and do not always fully benefit from the available computing power of recent processors. For instance, when the compiler is not able to vectorize loops, up to a 8x factor can be lost. It is essential to first validate the core level performance before focusing on higher level issues. This paper presents the CQA tool, a loop-centric code quality analyzer based on a simplified unicore architecture performance modeling and on quality metrics. The tool analyzes the quality of the code generated by the compiler. It provides high level metrics along with human understandable reports that relates to source code. Our performance model assumes that all data are resident in the first level cache. It provides architectural bottlenecks and an estimation of the number of cycles spent in each iteration of a given innermost loop. Our modeling and analyses are statically done and requires no execution or recompilation of the application. We show practical examples of situations where our tool is able to provide very valuable information leading to a performance gain.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127489297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Performance evaluation of multi core systems for high throughput medical applications involving model predictive control 涉及模型预测控制的高通量医疗应用多核系统的性能评估

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116884

Madhurima Pore, Ayan Banerjee, S. Gupta

Many medical control devices used in case of critical patients have model predictive controllers (MPC). MPC estimate the drug level in the parts of patients body based on their human physiology model to either alarm the medical authority or change the drug infusion rate. This model prediction has to be completed before the drug infusion rate is changed i.e. every few seconds. Instead of mathematical models like the Pharmacokinetic models more accurate models such as spatio-temporal drug diffusion can be used for improving the prediction and prevention of drug overshoot and undershoot. However, these models require high computation capability of platforms like recent many core GPUs or Intel Xeon Phi (MIC) or IntelCore i7. This work explores thread level and data level parallelism and computation versus communication times of such different model predictive applications used in multiple patient monitoring in hospital data centers exploiting the many core platforms for maximizing the throughput (i.e. patients monitored simultaneously). We also study the energy and performance of these applications to evaluate them for architecture suitability. We show that given a set of MPC applications, mapping on heterogeneous platforms can give performance improvement and energy savings.

许多用于危重病人的医疗控制设备都具有模型预测控制器(MPC)。MPC根据人体生理学模型估计患者体内各部位的药物水平，以报警或改变药物输注速率。该模型预测必须在药物输注速率改变之前完成，即每隔几秒完成一次。代替药代动力学模型等数学模型，可以使用更精确的药物时空扩散模型来改进药物超调和过调的预测和预防。然而，这些模型需要高计算能力的平台，如最近的许多核心gpu或Intel Xeon Phi (MIC)或IntelCore i7。这项工作探讨了线程级和数据级的并行性，以及在医院数据中心的多个患者监测中使用的不同模型预测应用程序的计算与通信时间，这些应用程序利用许多核心平台来最大化吞吐量(即同时监测患者)。我们还研究了这些应用程序的能量和性能，以评估它们的架构适用性。我们展示了给定一组MPC应用程序，在异构平台上进行映射可以提高性能并节省能源。

{"title":"Performance evaluation of multi core systems for high throughput medical applications involving model predictive control","authors":"Madhurima Pore, Ayan Banerjee, S. Gupta","doi":"10.1109/HiPC.2014.7116884","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116884","url":null,"abstract":"Many medical control devices used in case of critical patients have model predictive controllers (MPC). MPC estimate the drug level in the parts of patients body based on their human physiology model to either alarm the medical authority or change the drug infusion rate. This model prediction has to be completed before the drug infusion rate is changed i.e. every few seconds. Instead of mathematical models like the Pharmacokinetic models more accurate models such as spatio-temporal drug diffusion can be used for improving the prediction and prevention of drug overshoot and undershoot. However, these models require high computation capability of platforms like recent many core GPUs or Intel Xeon Phi (MIC) or IntelCore i7. This work explores thread level and data level parallelism and computation versus communication times of such different model predictive applications used in multiple patient monitoring in hospital data centers exploiting the many core platforms for maximizing the throughput (i.e. patients monitored simultaneously). We also study the energy and performance of these applications to evaluate them for architecture suitability. We show that given a set of MPC applications, mapping on heterogeneous platforms can give performance improvement and energy savings.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129427499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Balancing context switch penalty and response time with elastic time slicing 通过弹性时间切片平衡上下文切换惩罚和响应时间

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116707

Nagakishore Jammula, Moinuddin K. Qureshi, Ada Gavrilovska, Jongman Kim

Virtualization allows the platform to have increased number of logical processors by multiplexing the underlying resources across different virtual machines. The hardware resources get time shared not only between different virtual machines, but also between different workloads of the same virtual machine. An important source of performance degradation in such a scenario comes from the cache warmup penalties a workload experiences when it gets scheduled, as the working set belonging to the workload gets displaced by other concurrently running workloads. We show that a virtual machine that time switches between four workloads can cause some of the workloads a slowdown of as much as 54%. However, such performance degradation depends on the workload behavior, with some workloads experiencing negligible degradation and some severe degradation. We propose Elastic Time Slicing (ETS) to reduce the context switch overhead for the most affected workloads. We demonstrate that by taking the workload-specific context switch overhead into consideration, the CPU scheduler can make better decisions to minimize the context switch penalty for the most affected workloads, thereby resulting in substantial performance improvements. ETS enhances performance without compromising on response time, thereby achieving dual benefits. To facilitate ETS, we develop a low-overhead hardware-based mechanism that dynamically estimates the sensitivity of a given workload to context switching. We evaluate the accuracy of the mechanism under various cache management policies and show that it is very reliable. Context switch related warmup penalties increase as optimizations are applied to address traditional cache misses. For the first time, we assess the impact of advanced replacement policies and establish that it is significant.

虚拟化允许平台通过跨不同虚拟机复用底层资源来增加逻辑处理器的数量。硬件资源不仅可以在不同的虚拟机之间实现时间共享，而且可以在同一虚拟机的不同工作负载之间实现时间共享。在这种场景中，性能下降的一个重要来源来自工作负载在调度时所经历的缓存预热惩罚，因为属于该工作负载的工作集被其他并发运行的工作负载所取代。我们展示了在四个工作负载之间进行时间切换的虚拟机可能会导致某些工作负载的速度降低多达54%。但是，这种性能下降取决于工作负载的行为，有些工作负载的性能下降可以忽略不计，有些工作负载的性能下降非常严重。我们提出弹性时间切片(ETS)来减少受影响最大的工作负载的上下文切换开销。我们证明，通过考虑特定于工作负载的上下文切换开销，CPU调度器可以做出更好的决策，以最大限度地减少受影响最大的工作负载的上下文切换损失，从而带来实质性的性能改进。ETS在不影响响应时间的情况下提高了性能，从而实现了双重效益。为了促进ETS，我们开发了一种低开销的基于硬件的机制，可以动态估计给定工作负载对上下文切换的敏感性。我们评估了该机制在各种缓存管理策略下的准确性，并表明它是非常可靠的。当应用优化来解决传统的缓存丢失时，上下文切换相关的预热惩罚会增加。我们首次评估了先进替代政策的影响，并确定其意义重大。

{"title":"Balancing context switch penalty and response time with elastic time slicing","authors":"Nagakishore Jammula, Moinuddin K. Qureshi, Ada Gavrilovska, Jongman Kim","doi":"10.1109/HiPC.2014.7116707","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116707","url":null,"abstract":"Virtualization allows the platform to have increased number of logical processors by multiplexing the underlying resources across different virtual machines. The hardware resources get time shared not only between different virtual machines, but also between different workloads of the same virtual machine. An important source of performance degradation in such a scenario comes from the cache warmup penalties a workload experiences when it gets scheduled, as the working set belonging to the workload gets displaced by other concurrently running workloads. We show that a virtual machine that time switches between four workloads can cause some of the workloads a slowdown of as much as 54%. However, such performance degradation depends on the workload behavior, with some workloads experiencing negligible degradation and some severe degradation. We propose Elastic Time Slicing (ETS) to reduce the context switch overhead for the most affected workloads. We demonstrate that by taking the workload-specific context switch overhead into consideration, the CPU scheduler can make better decisions to minimize the context switch penalty for the most affected workloads, thereby resulting in substantial performance improvements. ETS enhances performance without compromising on response time, thereby achieving dual benefits. To facilitate ETS, we develop a low-overhead hardware-based mechanism that dynamically estimates the sensitivity of a given workload to context switching. We evaluate the accuracy of the mechanism under various cache management policies and show that it is very reliable. Context switch related warmup penalties increase as optimizations are applied to address traditional cache misses. For the first time, we assess the impact of advanced replacement policies and establish that it is significant.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125979252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

High performance MPI library over SR-IOV enabled infiniband clusters 高性能MPI库，支持SR-IOV的ib集群

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116876

Jie Zhang, Xiaoyi Lu, Jithin Jose, Mingzhe Li, Rong Shi, D. Panda

Virtualization has become a central role in HPC Cloud due to easy management and low cost of computation and communication. Recently, Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as InfiniBand and can attain near to native performance for inter-node communication. However, the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within a same physical node. To address this issue, this paper first proposes a high performance design of MPI library over SR-IOV enabled InfiniBand clusters by dynamically detecting VM locality and coordinating data movements between SR-IOV and Inter-VM shared memory (IVShmem) channels. Through our proposed design, MPI applications running in virtualized mode can achieve efficient locality-aware communication on SR-IOV enabled InfiniBand clusters. In addition, we optimize communications in IVShmem and SR-IOV channels by analyzing the performance impact of core mechanisms and parameters inside MPI library to deliver better performance in virtual machines. Finally, we conduct comprehensive performance studies by using point-to-point and collective benchmarks, and HPC applications. Experimental evaluations show that our proposed MPI library design can significantly improve the performance for point-to-point and collective operations, and MPI applications with different InfiniBand transport protocols (RC and UD) by up to 158%, 76%, 43%, respectively, compared with SR-IOV. To the best of our knowledge, this is the first study to offer a high performance MPI library that supports efficient locality aware MPI communication over SR-IOV enabled InfiniBand clusters.

由于易于管理和低成本的计算和通信，虚拟化已经成为HPC云的核心角色。最近，单根I/O虚拟化(SR-IOV)技术已经被引入到高性能互连中，例如InfiniBand，并且可以在节点间通信中获得接近本机的性能。然而，SR-IOV方案缺乏位置感知通信支持，这导致同一物理节点内的虚拟机间通信的性能开销。为了解决这个问题，本文首先提出了一种基于SR-IOV的InfiniBand集群的高性能MPI库设计，通过动态检测VM位置并协调SR-IOV和VM间共享内存(IVShmem)通道之间的数据移动。通过我们提出的设计，在虚拟化模式下运行的MPI应用程序可以在支持SR-IOV的InfiniBand集群上实现高效的位置感知通信。此外，我们通过分析MPI库内部核心机制和参数对性能的影响，优化了IVShmem和SR-IOV通道中的通信，从而在虚拟机中提供更好的性能。最后，我们通过使用点对点和集体基准以及HPC应用程序进行全面的性能研究。实验评估表明，与SR-IOV相比，我们提出的MPI库设计可以显着提高点对点和集体操作的性能，以及使用不同InfiniBand传输协议(RC和UD)的MPI应用程序的性能分别高达158%，76%和43%。据我们所知，这是第一个提供高性能MPI库的研究，该库支持在支持SR-IOV的InfiniBand集群上进行有效的位置感知MPI通信。

{"title":"High performance MPI library over SR-IOV enabled infiniband clusters","authors":"Jie Zhang, Xiaoyi Lu, Jithin Jose, Mingzhe Li, Rong Shi, D. Panda","doi":"10.1109/HiPC.2014.7116876","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116876","url":null,"abstract":"Virtualization has become a central role in HPC Cloud due to easy management and low cost of computation and communication. Recently, Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as InfiniBand and can attain near to native performance for inter-node communication. However, the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within a same physical node. To address this issue, this paper first proposes a high performance design of MPI library over SR-IOV enabled InfiniBand clusters by dynamically detecting VM locality and coordinating data movements between SR-IOV and Inter-VM shared memory (IVShmem) channels. Through our proposed design, MPI applications running in virtualized mode can achieve efficient locality-aware communication on SR-IOV enabled InfiniBand clusters. In addition, we optimize communications in IVShmem and SR-IOV channels by analyzing the performance impact of core mechanisms and parameters inside MPI library to deliver better performance in virtual machines. Finally, we conduct comprehensive performance studies by using point-to-point and collective benchmarks, and HPC applications. Experimental evaluations show that our proposed MPI library design can significantly improve the performance for point-to-point and collective operations, and MPI applications with different InfiniBand transport protocols (RC and UD) by up to 158%, 76%, 43%, respectively, compared with SR-IOV. To the best of our knowledge, this is the first study to offer a high performance MPI library that supports efficient locality aware MPI communication over SR-IOV enabled InfiniBand clusters.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132988300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters 在InfiniBand GPU集群上设计高效的节点间MPI通信小消息传输机制

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116873

Rong Shi, S. Potluri, Khaled Hamidouche, Jonathan L. Perkins, Mingzhe Li, D. Rossetti, D. Panda

Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.

越来越多的MPI应用程序正在被移植，以利用gpu提供的计算能力。GPU集群上的数据移动仍然是阻碍科学应用充分利用GPU潜力的主要瓶颈。以前，GPU-GPU节点间通信必须先将数据从GPU内存移动到主机内存，然后再通过网络发送数据。像MVAPICH2这样的MPI库提供了使用基于主机的流水线技术来缓解这一瓶颈的解决方案。此外，新推出的GPU Direct RDMA (GDR)是进一步解决这一数据移动瓶颈的一个有希望的解决方案。但是，MPI库中的现有设计对所有消息大小都应用了会合协议，由于需要额外的同步消息交换，这会给小消息通信带来相当大的开销。在本文中，我们提出了优化节点间gpu到gpu通信的新技术，用于小消息大小。我们的设计支持急切协议包括在发送方和接收方的有效支持。此外，我们提出了一种新的数据路径来提供主机和GPU存储器之间的快速拷贝。据我们所知，这是第一个使用急切协议为小消息大小的GPU通信提出有效设计的研究。我们的实验结果表明，gpu到gpu和cpu到gpu点对点通信的延迟分别减少了59%和63%。这些设计分别将单向带宽提高了7.3倍和1.7倍。我们还用两个最终应用来评估我们提出的设计:GPULBM和hood -blue。开普勒gpu上的性能数据显示，与现有最好的GDR设计相比，我们提出的设计分别实现了GPULBM的23.4%延迟降低和HOOMD-blue的58%平均TPS提高。

{"title":"Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters","authors":"Rong Shi, S. Potluri, Khaled Hamidouche, Jonathan L. Perkins, Mingzhe Li, D. Rossetti, D. Panda","doi":"10.1109/HiPC.2014.7116873","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116873","url":null,"abstract":"Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127118510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Relax-Miracle: GPU parallelization of semi-analytic fourier-domain solvers for earthquake modeling 松弛奇迹:地震建模半解析傅立叶域求解器的GPU并行化

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HIPC.2014.7116901

S. Masuti, S. Barbot, Nachiket Kapre

Effective utilization of GPU processing capacity for scientific workloads is often limited by memory throughput and PCIe communication transfer times. This is particularly true for semi-analytic Fourier-domain computations in earthquake modeling (Relax) where operations on large-scale 3D data structures can require moving large volumes of data from storage to the compute in predictable but orthogonal access patterns. We show how to transform the computation to avoid PCIe transfers entirely by reconstructing the 3D data structures directly within the GPU global memory. We also consider arithmetic transformations that replace some communication-intensive 1D FFTs with simpler, data-parallel analytical solutions. Using our approach we are able to reduce computation times for a geophysical model of the 2012 Mw8.7 Wharton Basin earthquake from 2 hours down to 15 minutes (speedup of ≈8x) for grid sizes of 512-512-256 when comparing NVIDIA K20 with a 16-threaded Intel Xeon E5-2670 CPU (supported by Intel-MKL libraries). Our GPU-accelerated solution (called Relax-Miracle) also makes it possible to conduct Markov-Chain Monte-Carlo simulations using more than 1000 time-dependent models on 12 GPUs per single day of calculation, enhancing our ability to use such techniques for time-consuming data inversion and Bayesian inversion experiments.

科学工作负载的GPU处理能力的有效利用通常受到内存吞吐量和PCIe通信传输时间的限制。这对于地震建模(Relax)中的半解析傅里叶域计算来说尤其如此，在这种情况下，对大规模3D数据结构的操作可能需要以可预测但正交的访问模式将大量数据从存储移动到计算。我们展示了如何通过在GPU全局内存中直接重建3D数据结构来转换计算以完全避免PCIe传输。我们还考虑用更简单的数据并行分析解决方案取代一些通信密集型1D fft的算术变换。使用我们的方法，当将NVIDIA K20与16线程英特尔至强E5-2670 CPU(由英特尔- mkl库支持)进行比较时，我们能够将2012年Mw8.7沃顿盆地地震的地球物理模型的计算时间从2小时减少到15分钟(加速≈8倍)，网格大小为512-512-256。我们的gpu加速解决方案(称为Relax-Miracle)也使得每天在12个gpu上使用超过1000个时间相关模型进行马尔可夫链蒙特卡罗模拟成为可能，增强了我们使用此类技术进行耗时数据反演和贝叶斯反演实验的能力。

{"title":"Relax-Miracle: GPU parallelization of semi-analytic fourier-domain solvers for earthquake modeling","authors":"S. Masuti, S. Barbot, Nachiket Kapre","doi":"10.1109/HIPC.2014.7116901","DOIUrl":"https://doi.org/10.1109/HIPC.2014.7116901","url":null,"abstract":"Effective utilization of GPU processing capacity for scientific workloads is often limited by memory throughput and PCIe communication transfer times. This is particularly true for semi-analytic Fourier-domain computations in earthquake modeling (Relax) where operations on large-scale 3D data structures can require moving large volumes of data from storage to the compute in predictable but orthogonal access patterns. We show how to transform the computation to avoid PCIe transfers entirely by reconstructing the 3D data structures directly within the GPU global memory. We also consider arithmetic transformations that replace some communication-intensive 1D FFTs with simpler, data-parallel analytical solutions. Using our approach we are able to reduce computation times for a geophysical model of the 2012 Mw8.7 Wharton Basin earthquake from 2 hours down to 15 minutes (speedup of ≈8x) for grid sizes of 512-512-256 when comparing NVIDIA K20 with a 16-threaded Intel Xeon E5-2670 CPU (supported by Intel-MKL libraries). Our GPU-accelerated solution (called Relax-Miracle) also makes it possible to conduct Markov-Chain Monte-Carlo simulations using more than 1000 time-dependent models on 12 GPUs per single day of calculation, enhancing our ability to use such techniques for time-consuming data inversion and Bayesian inversion experiments.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125999671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Online failure prediction for HPC resources using decentralized clustering 基于分散聚类的高性能计算资源在线故障预测

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116903

Alejandro Pelaez, Andres Quiroz, J. Browne, Edward Chuah, M. Parashar

Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.

随着这些机器的规模不断增长，确保大规模集群的高可靠性变得越来越重要，因为这会增加不同节点之间交互的复杂性和数量，从而导致高故障频率。因此，首先预测节点故障以防止错误的发生变得非常有价值。故障预测的一种常用方法是分析系统事件的跟踪，以找到事件类型或异常事件模式与节点故障之间的相关性，并在运行时使用识别为故障预测器的类型或模式。然而，以这种方式进行故障预测的典型集中式解决方案在非常大的范围内存在较高的传输和处理开销。我们提出了一种解决方案，通过使用分散式在线聚类算法(DOC)来检测资源使用日志中的异常，从而预测大规模集群中计算节点的软锁定问题，这些异常已被证明与超级计算机集群中特定类型的节点故障相关。我们通过使用德克萨斯高级计算中心Ranger超级计算机的监控日志来证明该系统的有效性。实验表明，该方法可以达到与其他相关方法相似的精度，同时保持较低的RAM和带宽使用，对当前运行的应用程序的运行时影响小于2%。

{"title":"Online failure prediction for HPC resources using decentralized clustering","authors":"Alejandro Pelaez, Andres Quiroz, J. Browne, Edward Chuah, M. Parashar","doi":"10.1109/HiPC.2014.7116903","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116903","url":null,"abstract":"Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126245724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Optical overlay NUCA: A high speed substrate for shared L2 caches 光学覆盖NUCA:用于共享L2缓存的高速衬底

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1145/3064833

E. Peter, A. Arora, Akriti Bagaria, S. Sarangi

In this paper, we propose to use optical NOCs to design cache access protocols for large shared L2 caches. We observe that the problem is unique because optical networks have very low latency, and in principle all the cache banks are very close to each other. A naive approach is to broadcast a request to a set of banks that might possibly contain the copy of a block. However, this approach is wasteful in terms of energy and bandwidth. Hence, we propose a novel scheme in this paper, TSI, which proposes to create a set of virtual networks (overlays) of cache banks over a physical optical NOC. We search for a block inside each overlay using a combination of multicast and unicast messages. We additionally create support for our overlay networks by proposing optimizations to the previously proposed R-SWMR network. We also propose a set of novel hardware structures for creating and managing overlays, and for efficiently locating blocks in the overlay. The performance of the TSI scheme is within 2-3% of a broadcast scheme, and it is faster than traditional static NUCA schemes by 50%. As compared to the broadcast scheme it reduces the number of accesses, and consequently the dynamic energy by 20-30%.

在本文中，我们建议使用光学noc来设计大型共享L2缓存的缓存访问协议。我们观察到这个问题是独特的，因为光网络具有非常低的延迟，并且原则上所有缓存库彼此非常接近。一种幼稚的方法是将请求广播到可能包含区块副本的一组银行。然而，这种方法在能量和带宽方面是浪费的。因此，我们在本文中提出了一种新颖的方案，TSI，它建议在物理光学NOC上创建一组虚拟网络(覆盖)缓存库。我们使用组播和单播消息的组合在每个覆盖层中搜索一个块。我们还通过对先前提出的R-SWMR网络进行优化，为我们的覆盖网络提供支持。我们还提出了一套新的硬件结构来创建和管理覆盖层，并有效地定位覆盖中的块。TSI方案的性能在广播方案的2-3%以内，比传统的静态NUCA方案快50%。与广播方案相比，它减少了访问次数，从而减少了20-30%的动态能量。

{"title":"Optical overlay NUCA: A high speed substrate for shared L2 caches","authors":"E. Peter, A. Arora, Akriti Bagaria, S. Sarangi","doi":"10.1145/3064833","DOIUrl":"https://doi.org/10.1145/3064833","url":null,"abstract":"In this paper, we propose to use optical NOCs to design cache access protocols for large shared L2 caches. We observe that the problem is unique because optical networks have very low latency, and in principle all the cache banks are very close to each other. A naive approach is to broadcast a request to a set of banks that might possibly contain the copy of a block. However, this approach is wasteful in terms of energy and bandwidth. Hence, we propose a novel scheme in this paper, TSI, which proposes to create a set of virtual networks (overlays) of cache banks over a physical optical NOC. We search for a block inside each overlay using a combination of multicast and unicast messages. We additionally create support for our overlay networks by proposing optimizations to the previously proposed R-SWMR network. We also propose a set of novel hardware structures for creating and managing overlays, and for efficiently locating blocks in the overlay. The performance of the TSI scheme is within 2-3% of a broadcast scheme, and it is faster than traditional static NUCA schemes by 50%. As compared to the broadcast scheme it reduces the number of accesses, and consequently the dynamic energy by 20-30%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128259144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Particle advection performance over varied architectures and workloads 粒子平流性能在不同的架构和工作负载

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116900

H. Childs, Scott Biersdorff, David Poliakoff, David Camp, A. Malony

Particle advection is a foundational operation for many flow visualization techniques, including streamlines, Finite-Time Lyapunov Exponents (FTLE) calculation, and stream surfaces. The workload for particle advection problems varies greatly, including significant variation in computational requirements. With this study, we consider the performance impacts from hardware architecture on this problem, studying distributed-memory systems with CPUs with varying amounts of cores per node, and with nodes with one to three GPUs. Our goal was to explore which architectures were best suited to which workloads, and why. While the results of this study will help inform visualization scientists which architectures they should use when solving certain flow visualization problems, it is also informative for the larger HPC community, since many simulation codes will soon incorporate visualization via in situ techniques.

粒子平流是许多流动可视化技术的基础操作，包括流线、有限时间李雅普诺夫指数(FTLE)计算和流表面。粒子平流问题的工作量变化很大，包括计算要求的显著变化。在这项研究中，我们考虑了硬件架构对这个问题的性能影响，研究了具有每个节点具有不同数量内核的cpu的分布式内存系统，以及具有1到3个gpu的节点。我们的目标是探索哪种架构最适合哪种工作负载，以及为什么。虽然这项研究的结果将有助于可视化科学家在解决某些流动可视化问题时应该使用哪种架构，但它也为更大的高性能计算社区提供了信息，因为许多模拟代码将很快通过原位技术纳入可视化。

引用次数: 9