2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献_第4页

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations 核构型相互作用计算的稀疏矩阵-多向量乘法优化

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.125

H. Aktulga, A. Buluç, Samuel Williams, Chao Yang

Obtaining highly accurate predictions on the properties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal Eigen pairs of the many-body nuclear Hamiltonian matrix. In the Many-body Fermion Dynamics for nuclei (MFDn) code, a block Eigen solver is used for this purpose. Due to the large size of the sparse matrices involved, a significant fraction of the time spent on the Eigen value computations is associated with the multiplication of a sparse matrix (and the transpose of that matrix) with multiple vectors (SpMM and SpMM_T). Existing implementations of SpMM and SpMM_T significantly underperform expectations. Thus, in this paper, we present and analyze optimized implementations of SpMM and SpMM_T. We base our implementation on the compressed sparse blocks (CSB) matrix format and target systems with multi-core architectures. We develop a performance model that allows us to understand and estimate the performance characteristics of our SpMM kernel implementations, and demonstrate the efficiency of our implementation on a series of real-world matrices extracted from MFDn. In particular, we obtain 3-4 speedup on the requisite operations over good implementations based on the commonly used compressed sparse row (CSR) matrix format. The improvements in the SpMM kernel suggest we may attain roughly a 40% speed up in the overall execution time of the block Eigen solver used in MFDn.

使用组态相互作用(CI)方法获得对轻原子核性质的高精度预测需要计算多体核哈密顿矩阵的几个极值特征对。在核的多体费米子动力学(MFDn)代码中，块特征解算器用于此目的。由于所涉及的稀疏矩阵的大小很大，在特征值计算上花费的时间的很大一部分与稀疏矩阵(以及该矩阵的转置)与多个向量(SpMM和SpMM_T)的乘法有关。SpMM和SpMM_T的现有实现明显低于预期。因此，在本文中，我们提出并分析了SpMM和SpMM_T的优化实现。我们的实现基于压缩稀疏块(CSB)矩阵格式，目标系统具有多核架构。我们开发了一个性能模型，使我们能够理解和估计SpMM内核实现的性能特征，并在从MFDn提取的一系列实际矩阵上演示我们的实现的效率。特别是，与基于常用压缩稀疏行(CSR)矩阵格式的良好实现相比，我们在必要的操作上获得了3-4的加速。SpMM内核的改进表明，我们可以在MFDn中使用的块特征解算器的总体执行时间上获得大约40%的速度提升。

{"title":"Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations","authors":"H. Aktulga, A. Buluç, Samuel Williams, Chao Yang","doi":"10.1109/IPDPS.2014.125","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.125","url":null,"abstract":"Obtaining highly accurate predictions on the properties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal Eigen pairs of the many-body nuclear Hamiltonian matrix. In the Many-body Fermion Dynamics for nuclei (MFDn) code, a block Eigen solver is used for this purpose. Due to the large size of the sparse matrices involved, a significant fraction of the time spent on the Eigen value computations is associated with the multiplication of a sparse matrix (and the transpose of that matrix) with multiple vectors (SpMM and SpMM_T). Existing implementations of SpMM and SpMM_T significantly underperform expectations. Thus, in this paper, we present and analyze optimized implementations of SpMM and SpMM_T. We base our implementation on the compressed sparse blocks (CSB) matrix format and target systems with multi-core architectures. We develop a performance model that allows us to understand and estimate the performance characteristics of our SpMM kernel implementations, and demonstrate the efficiency of our implementation on a series of real-world matrices extracted from MFDn. In particular, we obtain 3-4 speedup on the requisite operations over good implementations based on the commonly used compressed sparse row (CSR) matrix format. The improvements in the SpMM kernel suggest we may attain roughly a 40% speed up in the overall execution time of the block Eigen solver used in MFDn.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116187419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters 改进Intel Xeon Phi协处理器集群上本地应用程序的通信性能和可扩展性

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.113

K. Vaidyanathan, K. Pamnany, Dhiraj D. Kalamkar, A. Heinecke, M. Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha G. Shet, Bharat Kaul, B. Joó, P. Dubey

Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.

基于Intel Xeon Phi协处理器的集群为并行工作负载提供了高计算和内存性能，并且还支持直接网络访问。许多现实世界中的应用程序都受到网络特性的显著影响，为了在这些集群上最大化这些应用程序的性能，有效地饱和网络带宽和/或隐藏通信延迟尤为重要。我们将演示如何使用用于数据传输的流水线dma、动态块大小和更好的异步进度等技术来实现这一点。我们还展示了在应用程序通信阶段避免序列化和最大化并行性的方法及其影响。此外，我们应用程序优化侧重于平衡计算和通信，以隐藏通信延迟和提高利用率的核心和网络带宽。我们演示了我们的技术对运行在Intel Xeon Phi协处理器上的三种知名且高度优化的HPC内核的影响。对于来自Lattice QCD的Wilson-Dslash算子，我们描述了我们对通信性能的每个优化所带来的改进，应用我们的方法在通信阶段最大化并发性，并显示了比之前发布的最佳结果总体上提高了48%。对于HPL/LINPACK，我们在128个Intel Xeon Phi协处理器上以97 TFLOPs显示了68.5%的效率，这是有史以来第一次在基于协处理器的超级计算机上报道的本地HPL效率。对于FFT，我们在TACC Stampede集群上使用1024个Intel Xeon Phi协处理器显示了10.8 TFLOPs，这是在任何基于Intel架构的集群上报告的最高性能，也是在基于协处理器的超级计算机上报告的第一个这样的结果。

{"title":"Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters","authors":"K. Vaidyanathan, K. Pamnany, Dhiraj D. Kalamkar, A. Heinecke, M. Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha G. Shet, Bharat Kaul, B. Joó, P. Dubey","doi":"10.1109/IPDPS.2014.113","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.113","url":null,"abstract":"Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114858771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination 通过跨应用程序协调减少高性能计算系统中的I/O干扰

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.27

Matthieu Dorier, Gabriel Antoniu, R. Ross, D. Kimpe, Shadi Ibrahim

Unmatched computation and storage performance in new HPC systems have led to a plethora of I/O optimizations ranging from application-side collective I/O to network and disk-level request scheduling on the file system side. As we deal with ever larger machines, the interference produced by multiple applications accessing a shared parallel file system in a concurrent manner becomes a major problem. Interference often breaks single-application I/O optimizations, dramatically degrading application I/O performance and, as a result, lowering machine wide efficiency. This paper focuses on CALCioM, a framework that aims to mitigate I/O interference through the dynamic selection of appropriate scheduling policies. CALCioM allows several applications running on a supercomputer to communicate and coordinate their I/O strategy in order to avoid interfering with one another. In this work, we examine four I/O strategies that can be accommodated in this framework: serializing, interrupting, interfering and coordinating. Experiments on Argonne's BG/P Surveyor machine and on several clusters of the French Grid'5000 show how CALCioM can be used to efficiently and transparently improve the scheduling strategy between two otherwise interfering applications, given specified metrics of machine wide efficiency.

在新的HPC系统中，无与伦比的计算和存储性能导致了大量的I/O优化，从应用程序端的集体I/O到文件系统端的网络和磁盘级请求调度。当我们处理越来越大的机器时，多个应用程序以并发方式访问共享并行文件系统所产生的干扰成为一个主要问题。干扰通常会破坏单个应用程序的I/O优化，从而显著降低应用程序的I/O性能，从而降低整个机器的效率。本文重点介绍了calcom框架，该框架旨在通过动态选择适当的调度策略来减轻I/O干扰。calcom允许在超级计算机上运行的几个应用程序通信和协调它们的I/O策略，以避免相互干扰。在这项工作中，我们研究了可以在这个框架中容纳的四种I/O策略:序列化、中断、干扰和协调。在Argonne的BG/P Surveyor机器和法国电网5000的几个集群上进行的实验表明，在给定特定的机器效率指标的情况下，如何使用CALCioM来高效、透明地改进两个干扰应用程序之间的调度策略。

{"title":"CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination","authors":"Matthieu Dorier, Gabriel Antoniu, R. Ross, D. Kimpe, Shadi Ibrahim","doi":"10.1109/IPDPS.2014.27","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.27","url":null,"abstract":"Unmatched computation and storage performance in new HPC systems have led to a plethora of I/O optimizations ranging from application-side collective I/O to network and disk-level request scheduling on the file system side. As we deal with ever larger machines, the interference produced by multiple applications accessing a shared parallel file system in a concurrent manner becomes a major problem. Interference often breaks single-application I/O optimizations, dramatically degrading application I/O performance and, as a result, lowering machine wide efficiency. This paper focuses on CALCioM, a framework that aims to mitigate I/O interference through the dynamic selection of appropriate scheduling policies. CALCioM allows several applications running on a supercomputer to communicate and coordinate their I/O strategy in order to avoid interfering with one another. In this work, we examine four I/O strategies that can be accommodated in this framework: serializing, interrupting, interfering and coordinating. Experiments on Argonne's BG/P Surveyor machine and on several clusters of the French Grid'5000 show how CALCioM can be used to efficiently and transparently improve the scheduling strategy between two otherwise interfering applications, given specified metrics of machine wide efficiency.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121895244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Active Measurement of the Impact of Network Switch Utilization on Application Performance 网络交换机利用率对应用性能影响的主动测量

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.28

Marc Casas, G. Bronevetsky

Inter-node networks are a key capability of High-Performance Computing (HPC) systems that differentiates them from less capable classes of machines. However, in spite of their very high performance, the increasing computational power of HPC compute nodes and the associated rise in application communication needs make network performance a common performance bottleneck. To achieve high performance in spite of network limitations application developers require tools to measure their applications' network utilization and inform them about how the network's communication capacity relates to the performance of their applications. This paper presents a new performance measurement and analysis methodology based on empirical measurements of network behavior. Our approach uses two benchmarks that inject extra network communication. The first probes the fraction of the network that is utilized by a software component (an application or an individual task) to determine the existence and severity of network contention. The second aggressively injects network traffic while a software component runs to evaluate its performance on less capable networks or when it shares the network with other software components. We then combine the information from the two types of experiments to predict the performance slowdown experienced by multiple software components (e.g. multiple processes of a single MPI application) when they share a single network. Our methodology is applied to individual network switches and demonstrated taking 6 representative HPC applications and predicting the performance slowdowns of the 36 possible application pairs. The average error of our predictions is less than 10%.

节点间网络是高性能计算(HPC)系统的一项关键功能，它将高性能计算系统与性能较差的机器区分开来。然而，尽管高性能计算节点具有非常高的性能，但随着高性能计算节点计算能力的不断增强以及应用程序通信需求的增加，网络性能成为常见的性能瓶颈。为了在网络限制的情况下实现高性能，应用程序开发人员需要工具来测量其应用程序的网络利用率，并告知他们网络通信容量与应用程序性能的关系。本文提出了一种基于网络行为实证测量的新型性能测量与分析方法。我们的方法使用了两个注入额外网络通信的基准测试。第一种方法探测软件组件(应用程序或单个任务)使用的网络部分，以确定网络争用的存在和严重程度。第二种方法是在软件组件运行时注入网络流量，以评估其在性能较差的网络上的性能，或者与其他软件组件共享网络。然后，我们结合两种实验的信息来预测多个软件组件(例如单个MPI应用程序的多个进程)在共享单个网络时所经历的性能下降。我们的方法应用于单个网络交换机，并以6个具有代表性的HPC应用程序为例进行了演示，并预测了36对可能的应用程序的性能下降。我们预测的平均误差小于10%。

{"title":"Active Measurement of the Impact of Network Switch Utilization on Application Performance","authors":"Marc Casas, G. Bronevetsky","doi":"10.1109/IPDPS.2014.28","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.28","url":null,"abstract":"Inter-node networks are a key capability of High-Performance Computing (HPC) systems that differentiates them from less capable classes of machines. However, in spite of their very high performance, the increasing computational power of HPC compute nodes and the associated rise in application communication needs make network performance a common performance bottleneck. To achieve high performance in spite of network limitations application developers require tools to measure their applications' network utilization and inform them about how the network's communication capacity relates to the performance of their applications. This paper presents a new performance measurement and analysis methodology based on empirical measurements of network behavior. Our approach uses two benchmarks that inject extra network communication. The first probes the fraction of the network that is utilized by a software component (an application or an individual task) to determine the existence and severity of network contention. The second aggressively injects network traffic while a software component runs to evaluate its performance on less capable networks or when it shares the network with other software components. We then combine the information from the two types of experiments to predict the performance slowdown experienced by multiple software components (e.g. multiple processes of a single MPI application) when they share a single network. Our methodology is applied to individual network switches and demonstrated taking 6 representative HPC applications and predicting the performance slowdowns of the 36 possible application pairs. The average error of our predictions is less than 10%.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121506055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths 单源最短路径的高效并行GPU方法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.45

A. Davidson, S. Baxter, M. Garland, John Douglas Owens

Finding the shortest paths from a single source to all other vertices is a fundamental method used in a variety of higher-level graph algorithms. We present three parallel friendly and work-efficient methods to solve this Single-Source Shortest Paths (SSSP) problem: Work front Sweep, Near-Far and Bucketing. These methods choose different approaches to balance the trade off between saving work and organizational overhead. In practice, all of these methods do much less work than traditional Bellman-Ford methods, while adding only a modest amount of extra work over serial methods. These methods are designed to have a sufficient parallel workload to fill modern massively-parallel machines, and select reorganizational schemes that map well to these architectures. We show that in general our Near-Far method has the highest performance on modern GPUs, outperforming other parallel methods. We also explore a variety of parallel load-balanced graph traversal strategies and apply them towards our SSSP solver. Our work-saving methods always outperform a traditional GPU Bellman-Ford implementation, achieving rates up to 14x higher on low-degree graphs and 340x higher on scale free graphs. We also see significant speedups (20-60x) when compared against a serial implementation on graphs with adequately high degree.

寻找从单个顶点到所有其他顶点的最短路径是各种高级图算法中使用的基本方法。我们提出了三种并行友好且工作效率高的方法来解决这一单源最短路径问题:工作前扫描、近远和桶状。这些方法选择不同的方法来平衡节省工作和组织开销之间的权衡。在实践中，所有这些方法所做的工作都比传统的Bellman-Ford方法少得多，而在串行方法上只增加了少量的额外工作。这些方法被设计为具有足够的并行工作负载来填充现代大规模并行机器，并选择映射到这些体系结构的重组方案。我们表明，一般来说，我们的近远方法在现代gpu上具有最高的性能，优于其他并行方法。我们还探索了各种并行负载平衡图遍历策略，并将它们应用于我们的SSSP求解器。我们节省工作的方法总是优于传统的GPU Bellman-Ford实现，在低度图上实现高达14倍的速率，在无尺度图上实现高达340倍的速率。我们还看到了显著的加速(20-60倍)，当与足够高程度的图形上的串行实现相比时。

{"title":"Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths","authors":"A. Davidson, S. Baxter, M. Garland, John Douglas Owens","doi":"10.1109/IPDPS.2014.45","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.45","url":null,"abstract":"Finding the shortest paths from a single source to all other vertices is a fundamental method used in a variety of higher-level graph algorithms. We present three parallel friendly and work-efficient methods to solve this Single-Source Shortest Paths (SSSP) problem: Work front Sweep, Near-Far and Bucketing. These methods choose different approaches to balance the trade off between saving work and organizational overhead. In practice, all of these methods do much less work than traditional Bellman-Ford methods, while adding only a modest amount of extra work over serial methods. These methods are designed to have a sufficient parallel workload to fill modern massively-parallel machines, and select reorganizational schemes that map well to these architectures. We show that in general our Near-Far method has the highest performance on modern GPUs, outperforming other parallel methods. We also explore a variety of parallel load-balanced graph traversal strategies and apply them towards our SSSP solver. Our work-saving methods always outperform a traditional GPU Bellman-Ford implementation, achieving rates up to 14x higher on low-degree graphs and 340x higher on scale free graphs. We also see significant speedups (20-60x) when compared against a serial implementation on graphs with adequately high degree.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125688699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 176

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU 迈向节能计算的一步:在CPU-GPU上重新设计流体力学应用程序

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.103

Tingxing Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, J. Dongarra

Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs BLAST, Dobrev. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor's library routines substantially in performance. We proposed an auto tuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2.5× overall speedup and 1.42× energy efficiency (green up) using 4th order (Q_4) finite elements, and a 1.9× speedup and 1.27× green up using 2nd order (Q2) finite elements.

在高性能计算中，功率和能耗日益受到关注。与多核cpu相比，gpu的每瓦特性能要好得多。在本文中，我们讨论了重新设计BLAST中计算最密集的部分的努力，BLAST是一个使用gpu BLAST, Dobrev解决高阶有限元可压缩流体动力学方程的应用程序。为了充分利用gpu的硬件并行性，实现高性能，我们实现了自定义线性代数内核。我们通过利用内存层次结构对CUDA内核进行了密集优化，在性能上大大超过了供应商的库例程。我们提出了一种自动调谐技术，使我们的CUDA内核适应有限元法的阶数。与之前的基础实现相比，我们的重新设计和优化在两个方面降低了GPU的能耗:解决方案的时间减少了60%，所需功率减少了10%。与仅使用cpu的解决方案相比，我们的GPU加速BLAST使用四阶(Q_4)有限元获得了2.5倍的总体加速和1.42倍的能效(绿色向上)，使用二阶(Q2)有限元获得了1.9倍的加速和1.27倍的绿色向上。

{"title":"A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU","authors":"Tingxing Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, J. Dongarra","doi":"10.1109/IPDPS.2014.103","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.103","url":null,"abstract":"Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs BLAST, Dobrev. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor's library routines substantially in performance. We proposed an auto tuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2.5× overall speedup and 1.42× energy efficiency (green up) using 4th order (Q_4) finite elements, and a 1.9× speedup and 1.27× green up using 2nd order (Q2) finite elements.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133930004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

A Multi-core Parallel Branch-and-Bound Algorithm Using Factorial Number System 基于阶乘数系统的多核并行分支定界算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.124

M. Mezmaz, Rudi Leroy, N. Melab, D. Tuyttens

Many real-world problems in different industrial and economic fields are permutation combinatorial optimization problems. Solving to optimality large instances of these problems, such as flowshop problem, is a challenge for multi-core computing. This paper proposes a multi-threaded factoradic-based branch-and-bound algorithm to solve permutation combinatorial problems on multi-core processors. The factoradic, called also factorial number system, is a mixed radix numeral system adapted to numbering permutations. In this new parallel algorithm, the B&B is based on a matrix of integers instead of a pool of permutations, and work units exchanged between threads are intervals of factoradics instead of sets of nodes. Compared to a conventional pool-based approach, the obtained results on flowshop instances demonstrate that our new factoradic-based approach, on average, uses about 60 times less memory to store the pool of subproblems, generates about 1.3 times less page faults, waits about 7 times less time to synchronize the access to the pool, requires about 9 times less CPU time to manage this pool, and performs about 30,000 times less context switches.

在不同的工业和经济领域中，许多现实问题都是排列组合优化问题。以最优的方式解决这些问题的大型实例，如流水车间问题，是多核计算的一个挑战。针对多核处理器上的排列组合问题，提出了一种基于因子的多线程分支定界算法。因数数，又称阶乘数系统，是一种适应编号置换的混合基数数系统。在这种新的并行算法中，B&B基于整数矩阵而不是排列池，线程之间交换的工作单元是因数区间而不是节点集。与传统的基于池的方法相比，在flowshop实例上获得的结果表明，我们的新的基于因子的方法平均使用大约60倍的内存来存储子问题池，产生大约1.3倍的页面错误，等待大约7倍的时间来同步对池的访问，需要大约9倍的CPU时间来管理这个池，执行大约30,000倍的上下文切换。

{"title":"A Multi-core Parallel Branch-and-Bound Algorithm Using Factorial Number System","authors":"M. Mezmaz, Rudi Leroy, N. Melab, D. Tuyttens","doi":"10.1109/IPDPS.2014.124","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.124","url":null,"abstract":"Many real-world problems in different industrial and economic fields are permutation combinatorial optimization problems. Solving to optimality large instances of these problems, such as flowshop problem, is a challenge for multi-core computing. This paper proposes a multi-threaded factoradic-based branch-and-bound algorithm to solve permutation combinatorial problems on multi-core processors. The factoradic, called also factorial number system, is a mixed radix numeral system adapted to numbering permutations. In this new parallel algorithm, the B&B is based on a matrix of integers instead of a pool of permutations, and work units exchanged between threads are intervals of factoradics instead of sets of nodes. Compared to a conventional pool-based approach, the obtained results on flowshop instances demonstrate that our new factoradic-based approach, on average, uses about 60 times less memory to store the pool of subproblems, generates about 1.3 times less page faults, waits about 7 times less time to synchronize the access to the pool, requires about 9 times less CPU time to manage this pool, and performs about 30,000 times less context switches.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133032014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Pythia: Faster Big Data in Motion through Predictive Software-Defined Network Optimization at Runtime Pythia:在运行时通过预测性软件定义的网络优化实现更快的动态大数据

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.20

M. V. Neves, C. Rose, K. Katrinis, H. Franke

The rise of Internet of Things sensors, social networking and mobile devices has led to an explosion of available data. Gaining insights into this data has led to the area of Big Data analytics. The MapReduce framework, as implemented in Hadoop, is one of the most popular frameworks for Big Data analysis. To handle the ever-increasing data size, Hadoop is a scalable framework that allows dedicated, seemingly unbound numbers of servers to participate in the analytics process. Response time of an analytics request is an important factor for time to value/insights. While the compute and disk I/O requirements can be scaled with the number of servers, scaling the system leads to increased network traffic. Arguably, the communication-heavy phase of MapReduce contributes significantly to the overall response time, the problem is further aggravated, if communication patterns are heavily skewed, as is not uncommon in many MapReduce workloads. In this paper we present a system that reduces the skew impact by transparently predicting data communication volume at runtime and mapping the many end-to-end flows among the various processes to the underlying network, using emerging software-defined networking technologies to avoid hotspots in the network. Dependent on the network oversubscription ratio, we demonstrate reduction in job completion time between 3% and 46% for popular MapReduce benchmarks like Sort and Nutch.

物联网传感器、社交网络和移动设备的兴起导致了可用数据的爆炸式增长。对这些数据的深入了解导致了大数据分析领域的出现。在Hadoop中实现的MapReduce框架是大数据分析中最流行的框架之一。为了处理不断增长的数据量，Hadoop是一个可扩展的框架，它允许专用的、看似不受限制的服务器参与分析过程。分析请求的响应时间是获得价值/见解的时间的重要因素。虽然计算和磁盘I/O需求可以随着服务器数量的增加而增加，但是扩展系统会导致网络流量的增加。可以说，MapReduce的通信繁重阶段对总体响应时间贡献很大，如果通信模式严重倾斜，问题会进一步恶化，这在许多MapReduce工作负载中并不少见。在本文中，我们提出了一个系统，通过透明地预测运行时的数据通信量，并将各种进程之间的许多端到端流映射到底层网络，使用新兴的软件定义网络技术来避免网络中的热点，从而减少了倾斜影响。根据网络超额订阅率，我们证明了在流行的MapReduce基准测试(如Sort和Nutch)中，任务完成时间减少了3%到46%。

{"title":"Pythia: Faster Big Data in Motion through Predictive Software-Defined Network Optimization at Runtime","authors":"M. V. Neves, C. Rose, K. Katrinis, H. Franke","doi":"10.1109/IPDPS.2014.20","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.20","url":null,"abstract":"The rise of Internet of Things sensors, social networking and mobile devices has led to an explosion of available data. Gaining insights into this data has led to the area of Big Data analytics. The MapReduce framework, as implemented in Hadoop, is one of the most popular frameworks for Big Data analysis. To handle the ever-increasing data size, Hadoop is a scalable framework that allows dedicated, seemingly unbound numbers of servers to participate in the analytics process. Response time of an analytics request is an important factor for time to value/insights. While the compute and disk I/O requirements can be scaled with the number of servers, scaling the system leads to increased network traffic. Arguably, the communication-heavy phase of MapReduce contributes significantly to the overall response time, the problem is further aggravated, if communication patterns are heavily skewed, as is not uncommon in many MapReduce workloads. In this paper we present a system that reduces the skew impact by transparently predicting data communication volume at runtime and mapping the many end-to-end flows among the various processes to the underlying network, using emerging software-defined networking technologies to avoid hotspots in the network. Dependent on the network oversubscription ratio, we demonstrate reduction in job completion time between 3% and 46% for popular MapReduce benchmarks like Sort and Nutch.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133203545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization 在LU分解中有效地利用了所有问题规模的并行尺度

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.109

Md Rakib Hasan, R. C. Whaley

LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size, (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.

LU分解是求解线性方程最广泛使用的方法之一，因此它的性能是广泛的科学计算的基础。由于架构趋势已经用并行规模的增加取代了时钟速率的改进，库编写者已经通过使用平铺算法来应对，在平铺算法中，操作数的大小受到限制，以便最大化并行性，正如在著名的PLASMA库中所见。这种方法有两个主要缺点:(1)由于有限的操作数大小而降低了渐近性能;(2)由于并行缓存中不必要的数据移动而降低了中小型问题的性能。在本文中，我们介绍了一种新的方法，其中通过使用ATLAS框架自动生成的特殊低开销内核原语来最大化渐近性能，同时通过使用显式缓存管理来最小化不必要的缓存运动。我们表明，在商用并行英特尔和AMD平台上，这种技术可以在所有问题规模上优于所有已知的库，在12核英特尔至强处理器上的渐近LU性能大约为硬件理论峰值的91%，在32核AMD Opteron上的渐近LU性能为87%。

{"title":"Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization","authors":"Md Rakib Hasan, R. C. Whaley","doi":"10.1109/IPDPS.2014.109","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.109","url":null,"abstract":"LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size, (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129849743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Multi-resource Real-Time Reader/Writer Locks for Multiprocessors 多处理器的多资源实时读/写锁

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.29

Bryan C. Ward, James H. Anderson

A fine-grained locking protocol permits multiple locks to be held simultaneously by the same task. In the case of real-time multiprocessor systems, prior work on such protocols has considered only mutex constraints. This unacceptably limits concurrency in systems in which some resource accesses are read-only. To remedy this situation, a variant of a recently proposed fine-grained protocol called the real-time nested locking protocol (RNLP) is presented that enables concurrent reads. This variant is shown to have worst-case blocking no worse (and often better) than existing coarse-grained real-time reader/writer locking protocols, while allowing for additional parallelism. Experimental evaluations of the proposed protocol are presented that consider both schedulability (i.e., the ability to validate timing constraints) and implementation-related overheads. These evaluations demonstrate that the RNLP (both the mutex and the proposed reader/writer variant) provides improved schedulability over existing coarse-grained locking protocols, and is practically implementable.

细粒度锁定协议允许同一任务同时持有多个锁。在实时多处理器系统的情况下，此类协议的先前工作只考虑互斥约束。这在某些资源访问为只读的系统中限制了并发性，这是不可接受的。为了纠正这种情况，提出了一种最近提出的细粒度协议的变体，称为实时嵌套锁定协议(RNLP)，它支持并发读取。这种变体的最坏情况阻塞并不比现有的粗粒度实时读/写锁定协议差(通常更好)，同时允许额外的并行性。提出了考虑可调度性(即验证时间约束的能力)和实现相关开销的协议的实验评估。这些评估表明，RNLP(互斥锁和建议的读/写变体)比现有的粗粒度锁定协议提供了更好的可调度性，并且实际上是可实现的。

引用次数: 21