2014 21st International Conference on High Performance Computing (HiPC)最新文献

英文中文

Design and evaluation of parallel hashing over large-scale data 大规模数据并行哈希的设计与评估

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-20 DOI: 10.1109/HiPC.2014.7116909

Long Cheng, S. Kotoulas, Tomas E. Ward, G. Theodoropoulos

High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result - the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.

高性能分析数据处理系统通常运行在具有大量内存的服务器上。在这种环境中使用的常见数据结构是哈希表。本文主要研究用于处理大规模数据的高效并行哈希算法。目前，分布式架构上的哈希表是由本地或远程线程一次访问一个键，而共享内存方法侧重于使用多个线程访问单个表。一种相对简单的“批量手术”方法似乎被研究人员忽视了。在这项工作中，利用这种方法，我们提出了一个高级并行哈希框架，结构化并行哈希，旨在高效地处理分布式内存上的海量数据。我们对所提出的方法进行了理论分析，并描述了我们的哈希实现的设计。评估揭示了一个非常有趣的结果——所建议的直接方法可以大大优于分布式哈希方法，甚至可以提供与基于使用专用硬件谓词的共享内存超级计算机的方法相当的性能。此外，我们通过广泛的实验来描述我们的哈希实现的性能，从而允许系统开发人员为他们的高性能应用程序做出更明智的选择。

{"title":"Design and evaluation of parallel hashing over large-scale data","authors":"Long Cheng, S. Kotoulas, Tomas E. Ward, G. Theodoropoulos","doi":"10.1109/HiPC.2014.7116909","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116909","url":null,"abstract":"High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result - the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123145178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Saving energy by exploiting residual imbalances on iterative applications 通过利用迭代应用程序上的剩余不平衡来节省能量

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116895

E. Padoin, M. Castro, L. Pilla, P. Navaux, J. Méhaut

The power consumption of High Performance Computing (HPC) systems is an increasing concern as large-scale systems grow in size and, consequently, consume more energy. In response to this challenge, we propose two variants of a new energy-aware load balancer that aim at reducing the energy consumption of parallel platforms running imbalanced scientific applications without degrading their performance. Our research combines dynamic load balancing with DVFS techniques in order to reduce the clock frequency of underloaded computing cores which experience some residual imbalance even after tasks are remapped. Experimental results with benchmarks and a real-world application presented energy savings of up to 32% with our fine-grained variant that performs per-core DVFS, and of up to 34% with our coarsegrained variant that performs per-chip DVFS.

随着大型系统规模的增长，高性能计算(HPC)系统的功耗日益受到关注，从而消耗更多的能源。为了应对这一挑战，我们提出了一种新型能量感知负载均衡器的两种变体，旨在减少运行不平衡科学应用的并行平台的能耗，同时不降低其性能。我们的研究将动态负载平衡与DVFS技术相结合，以降低低负载计算核心的时钟频率，即使在任务重新映射后也会出现一些残留的不平衡。通过基准测试和实际应用的实验结果表明，使用我们的细粒度变体执行每核DVFS可节省高达32%的能源，使用我们的粗粒度变体执行每芯片DVFS可节省高达34%的能源。

引用次数: 14

Optimization of scan algorithms on multi- and many-core processors 多核和多核处理器扫描算法的优化

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116883

Qiao Sun, Chao Yang

Scan is a basic building block widely utilized in many applications. With the emergence of multi-core and many-core processors, the study of highly scalable parallel scan algorithms becomes increasingly important. In this paper, we first propose a novel parallel scan algorithm based on the fine grain dynamic task scheduling in QUARK, and then derive a cache-friendly framework for any parallel scan kernel. The QUARK-scan is superior to the fastest available counterpart proposed by Zhang in 2012 and many other parallel scans in several aspects, including the greatly improved load balance and the substantially reduced number of global barriers. On the other hand, the cache-friendly framework helps in improving the cache line usage and is flexible to apply to any parallel scan kernel. A variety of optimization techniques such as SIMD vectorization, loop unrolling, adjacent synchronization and thread affinity are exploited in QUARKscan and the cache-friendly versions of both QUARK-scan and Zhang's scan. Experiments done on three typical multi- and many-core platforms indicate that the proposed QUARK-scan and the cache-friendly Zhang's scan are superior in different scenarios.

Scan是在许多应用中广泛使用的基本构件。随着多核和多核处理器的出现，高可扩展性并行扫描算法的研究变得越来越重要。本文首先提出了一种新的基于QUARK中细粒度动态任务调度的并行扫描算法，然后推导了一个适用于任意并行扫描内核的缓存友好框架。QUARK-scan在几个方面优于Zhang在2012年提出的最快的并行扫描和许多其他并行扫描，包括大大改善了负载平衡和大幅减少了全局屏障的数量。另一方面，缓存友好的框架有助于提高缓存线的使用，并且可以灵活地应用于任何并行扫描内核。在QUARKscan以及QUARK-scan和Zhang的扫描的缓存友好版本中，利用了各种优化技术，如SIMD矢量化、循环展开、相邻同步和线程亲和性。在三个典型的多核和多核平台上进行的实验表明，所提出的夸克扫描和缓存友好的张氏扫描在不同的场景下都是优越的。

{"title":"Optimization of scan algorithms on multi- and many-core processors","authors":"Qiao Sun, Chao Yang","doi":"10.1109/HiPC.2014.7116883","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116883","url":null,"abstract":"Scan is a basic building block widely utilized in many applications. With the emergence of multi-core and many-core processors, the study of highly scalable parallel scan algorithms becomes increasingly important. In this paper, we first propose a novel parallel scan algorithm based on the fine grain dynamic task scheduling in QUARK, and then derive a cache-friendly framework for any parallel scan kernel. The QUARK-scan is superior to the fastest available counterpart proposed by Zhang in 2012 and many other parallel scans in several aspects, including the greatly improved load balance and the substantially reduced number of global barriers. On the other hand, the cache-friendly framework helps in improving the cache line usage and is flexible to apply to any parallel scan kernel. A variety of optimization techniques such as SIMD vectorization, loop unrolling, adjacent synchronization and thread affinity are exploited in QUARKscan and the cache-friendly versions of both QUARK-scan and Zhang's scan. Experiments done on three typical multi- and many-core platforms indicate that the proposed QUARK-scan and the cache-friendly Zhang's scan are superior in different scenarios.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116967333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Interface for heterogeneous kernels: A framework to enable hybrid OS designs targeting high performance computing on manycore architectures 异构内核接口:一个框架，用于实现针对多核架构的高性能计算的混合操作系统设计

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116885

Taku Shimosawa, Balazs Gerofi, Masamichi Takagi, Gou Nakamura, Tomoki Shirasawa, Yuji Saeki, M. Shimizu, A. Hori, Y. Ishikawa

Turning towards exascale systems and beyond, it has been widely argued that the currently available systems software is not going to be feasible due to various requirements such as the ability to deal with heterogeneous architectures, the need for systems level optimization targeting specific applications, elimination of OS noise, and at the same time, compatibility with legacy applications. To cope with these issues, a hybrid design of operating systems where light-weight specialized kernels can cooperate with a traditional OS kernel seems adequate, and a number of recent research projects are now heading into this direction. This paper presents Interface for Heterogeneous Kernels (IHK), a general framework enabling hybrid kernel designs in systems equipped with manycore processors and/or accelerators. IHK provides a range of capabilities, such as resource partitioning, management of heterogeneous OS kernels, as well as a low-level communication layer among the kernels. We describe IHK's interface and demonstrate its feasibility for hybrid kernel designs through executing various different lightweight OS kernels on top of it, which are specialized for certain types of applications. We use the Intel Xeon Phi, Intel's latest manycore coprocessor, as our experimental platform.

转向百亿亿级系统及其他系统，人们普遍认为，由于各种需求，如处理异构体系结构的能力、针对特定应用程序的系统级优化的需要、消除操作系统噪声的需要，以及与遗留应用程序的兼容性，目前可用的系统软件将不可行。为了解决这些问题，操作系统的混合设计似乎足够了，其中轻量级的专用内核可以与传统的操作系统内核合作，并且最近的许多研究项目正在朝着这个方向发展。本文提出了异构内核接口(IHK)，这是一个通用框架，可以在配备多核处理器和/或加速器的系统中实现混合内核设计。IHK提供了一系列功能，例如资源分区、异构操作系统内核的管理，以及内核之间的低级通信层。我们描述了IHK的接口，并通过在其上执行各种不同的轻量级操作系统内核来演示其混合内核设计的可行性，这些内核专门用于某些类型的应用程序。我们使用英特尔Xeon Phi，英特尔最新的多核协处理器，作为我们的实验平台。

{"title":"Interface for heterogeneous kernels: A framework to enable hybrid OS designs targeting high performance computing on manycore architectures","authors":"Taku Shimosawa, Balazs Gerofi, Masamichi Takagi, Gou Nakamura, Tomoki Shirasawa, Yuji Saeki, M. Shimizu, A. Hori, Y. Ishikawa","doi":"10.1109/HiPC.2014.7116885","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116885","url":null,"abstract":"Turning towards exascale systems and beyond, it has been widely argued that the currently available systems software is not going to be feasible due to various requirements such as the ability to deal with heterogeneous architectures, the need for systems level optimization targeting specific applications, elimination of OS noise, and at the same time, compatibility with legacy applications. To cope with these issues, a hybrid design of operating systems where light-weight specialized kernels can cooperate with a traditional OS kernel seems adequate, and a number of recent research projects are now heading into this direction. This paper presents Interface for Heterogeneous Kernels (IHK), a general framework enabling hybrid kernel designs in systems equipped with manycore processors and/or accelerators. IHK provides a range of capabilities, such as resource partitioning, management of heterogeneous OS kernels, as well as a low-level communication layer among the kernels. We describe IHK's interface and demonstrate its feasibility for hybrid kernel designs through executing various different lightweight OS kernels on top of it, which are specialized for certain types of applications. We use the Intel Xeon Phi, Intel's latest manycore coprocessor, as our experimental platform.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124849105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters 一种高性能广播设计，采用硬件组播和GPUDirect RDMA，适用于Infiniband集群上的流媒体应用

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116875

Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda

Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.

高性能计算领域的几个流应用程序通过利用现代gpgpu提供的原始计算能力，在执行时间上获得了显着的加速。这种原始的计算能力，加上InfiniBand (IB)等高性能互连提供的高网络吞吐量，使流应用程序能够快速扩展。构成多节点流应用程序执行的一个经常使用的操作是广播操作，其中来自单个源的数据被传输到多个接收器，通常来自实时数据站点。尽管像IB这样的高性能网络提供了像基于硬件的多播这样的新特性来加速广播操作的性能，但由于IB主机通道适配器(hca)无法直接访问gpgpu的内存，它们的好处仅限于基于主机的应用程序。这对严重依赖GPU内存广播操作的高性能流媒体应用程序构成了显著的性能瓶颈。最近推出的GPUDirect RDMA特性缓解了这一瓶颈，使IB hca能够直接执行数据传输到/从GPU内存(绕过主机内存)。因此，它为基于GPGPU的高性能流媒体应用程序设计高性能广播操作提供了一个有吸引力的替代方案。在这项工作中，我们提出了一种新的方法，充分利用GPUDirect RDMA和硬件多播特性，为流媒体应用设计高性能广播操作。与64个GPU节点上的初始方案相比，使用所提出的设计进行的实验显示延迟减少60%，吞吐量基准提高3 -4倍。

{"title":"A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters","authors":"Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda","doi":"10.1109/HiPC.2014.7116875","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116875","url":null,"abstract":"Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115186927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

A fast implementation of MLR-MCL algorithm on multi-core processors MLR-MCL算法在多核处理器上的快速实现

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116888

Q. Niu, Pai-Wei Lai, S. M. Faisal, S. Parthasarathy, P. Sadayappan

Widespread use of stochastic flow based graph clustering algorithms, e.g. Markov Clustering (MCL), has been hampered by their lack of scalability and fragmentation of output. Multi-Level Regularized Markov Clustering (MLR-MCL) is an improvement over Markov Clustering (MCL), providing faster performance and better quality of clusters for large graphs. However, a closer look at MLR-MCL's performance reveals potential for further improvement. In this paper we present a fast parallel implementation of MLR-MCL algorithm via static work partitioning based on analysis of memory footprints. By parallelizing the most time consuming region of the sequential MLR-MCL algorithm, we report up to 10.43x (5.22x in average) speedup on CPU, using 8 datasets from SNAP and 3 PPI datasets. In addition, our algorithm can be adapted to perform general sparse matrix-matrix multiplication (SpGEMM), and our experimental evaluation shows up to 3.50x (1.92x in average) speedup on CPU, and up to 5.12x (2.20x in average) speedup on MIC, comparing to the SpGEMM kernel provided by Intel Math Kernel Library (MKL).

广泛使用的基于随机流的图聚类算法，如马尔可夫聚类(MCL)，由于缺乏可扩展性和输出碎片化而受到阻碍。多层正则化马尔可夫聚类(MLR-MCL)是对马尔可夫聚类(MCL)的改进，为大型图提供更快的性能和更好的聚类质量。然而，仔细观察MLR-MCL的性能可以发现进一步改进的潜力。在本文中，我们提出了一种基于内存占用分析的静态工作划分的MLR-MCL算法的快速并行实现。通过并行化顺序MLR-MCL算法最耗时的区域，我们报告了CPU加速高达10.43倍(平均5.22倍)，使用来自SNAP的8个数据集和3个PPI数据集。此外，我们的算法可以适应于执行一般稀疏矩阵矩阵乘法(SpGEMM)，我们的实验评估表明，与英特尔数学内核库(MKL)提供的SpGEMM内核相比，我们的算法在CPU上的加速高达3.50倍(平均1.92倍)，在MIC上的加速高达5.12倍(平均2.20倍)。

引用次数: 12

Mixed-precision models for calculation of high-order virial coefficients on GPUs gpu上计算高阶维里系数的混合精度模型

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116898

Chao Feng, A. Schultz, V. Chaudhary, D. Kofke

The virial equation of state (VEOS) is a density expansion of the thermodynamic pressure with respect to an ideal-gas reference. Its coefficients can be computed from a molecular model, and become more expensive to calculate at higher order. In this paper, we use GPU to calculate the 8th, 9th and 10th virial coefficients of the Lennard-Jones (LJ) potential model by the Mayer Sampling Monte Carlo (MSMC) method and Wheatley's algorithm. Two mixed-precision models are proposed to overcome a potential precision limitation of current GPUs while maintaining the performance benefit. On the latest Kepler architecture GPU Tesla K40, an average speedup of 20 to 40 is achieved for these calculations.

维里状态方程(VEOS)是热力学压力相对于理想气体参考的密度展开式。它的系数可以从分子模型中计算出来，但在更高的阶上计算会变得更加昂贵。本文利用GPU，采用Mayer Sampling Monte Carlo (MSMC)方法和Wheatley算法计算Lennard-Jones (LJ)势模型的第8、9和10维里系数。提出了两种混合精度模型，以克服当前gpu潜在的精度限制，同时保持性能优势。在最新的开普勒架构GPU Tesla K40上，这些计算的平均加速达到了20到40。

引用次数: 0

Matrix-matrix multiplication on a large register file architecture with indirection 矩阵-矩阵乘法对大寄存器文件结构具有间接性

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116709

D. Sreedhar, J. Derby, R. Montoye, C. Johnson

Dense matrix-matrix multiply is an important kernel in many high performance computing applications including the emerging deep neural network based cognitive computing applications. Graphical processing units (GPU) have been very successful in handling dense matrix-matrix multiply in a variety of applications. However, recent research has shown that GPUs are very inefficient in using the available compute resources on the silicon for matrix multiply in terms of utilization of peak floating point operations per second (FLOPS). In this paper, we show that an architecture with a large register file supported by “indirection ” can utilize the floating point computing resources on the processor much more efficiently. A key feature of our proposed in-line accelerator is a bank-based very-large register file, with embedded SIMD support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank, overcoming the limited number of register file ports. Because each LCE is a SIMD computation element, and all of them can proceed concurrently, the PIR approach constitutes a highly-parallel super-wide-SIMD device. We show that we can achieve more than 25% better performance than the best known results for matrix multiply using GPUs. This is achieved using far lesser floating point computing units and hence lesser silicon area and power. We also show that architecture blends well with the Strassen and Winograd matrix multiply algorithms. We optimize the selective data parallelism that the LCEs enable for these algorithms and study the area-performance trade-offs.

密集矩阵-矩阵乘法是许多高性能计算应用的重要核心，包括新兴的基于深度神经网络的认知计算应用。图形处理单元(GPU)在处理各种应用中的密集矩阵-矩阵乘法方面非常成功。然而，最近的研究表明，就每秒峰值浮点运算(FLOPS)的利用率而言，gpu在利用硅上可用的计算资源进行矩阵乘法方面效率非常低。在本文中，我们证明了具有“间接”支持的大寄存器文件的架构可以更有效地利用处理器上的浮点计算资源。我们提出的内联加速器的一个关键特性是基于银行的超大寄存器文件，并具有嵌入式SIMD支持。这种注册文件中的处理器(PIR)策略通过附加在每个bank上的本地计算元素(lce)来实现，克服了寄存器文件端口数量有限的问题。由于每个LCE都是一个SIMD计算单元，并且它们都可以并发进行，因此PIR方法构成了一个高度并行的超宽SIMD器件。我们表明，我们可以获得比使用gpu进行矩阵乘法的最佳结果高出25%以上的性能。这是用更少的浮点计算单元实现的，因此更少的硅面积和功率。我们还表明，该架构与Strassen和Winograd矩阵乘法算法融合得很好。我们优化了lce为这些算法提供的选择性数据并行性，并研究了面积-性能权衡。

{"title":"Matrix-matrix multiplication on a large register file architecture with indirection","authors":"D. Sreedhar, J. Derby, R. Montoye, C. Johnson","doi":"10.1109/HiPC.2014.7116709","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116709","url":null,"abstract":"Dense matrix-matrix multiply is an important kernel in many high performance computing applications including the emerging deep neural network based cognitive computing applications. Graphical processing units (GPU) have been very successful in handling dense matrix-matrix multiply in a variety of applications. However, recent research has shown that GPUs are very inefficient in using the available compute resources on the silicon for matrix multiply in terms of utilization of peak floating point operations per second (FLOPS). In this paper, we show that an architecture with a large register file supported by “indirection ” can utilize the floating point computing resources on the processor much more efficiently. A key feature of our proposed in-line accelerator is a bank-based very-large register file, with embedded SIMD support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank, overcoming the limited number of register file ports. Because each LCE is a SIMD computation element, and all of them can proceed concurrently, the PIR approach constitutes a highly-parallel super-wide-SIMD device. We show that we can achieve more than 25% better performance than the best known results for matrix multiply using GPUs. This is achieved using far lesser floating point computing units and hence lesser silicon area and power. We also show that architecture blends well with the Strassen and Winograd matrix multiply algorithms. We optimize the selective data parallelism that the LCEs enable for these algorithms and study the area-performance trade-offs.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128314376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Xevolver: An XML-based code translation framework for supporting HPC application migration Xevolver:一个基于xml的代码转换框架，用于支持HPC应用程序迁移

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116902

H. Takizawa, S. Hirasawa, Yasuharu Hayashi, Ryusuke Egawa, Hiroaki Kobayashi

This paper proposes an extensible programming framework to separate platform-specific optimizations from application codes. The framework allows programmers to define their own code translation rules for special demands of individual systems, compilers, libraries, and applications. Code translation rules associated with user-defined compiler directives are defined in an external file, and the application code is just annotated by the directives. For code transformations based on the rules, the framework exposes the abstract syntax tree (AST) of an application code as an XML document to expert programmers. Hence, the XML document of an AST can be transformed using any XML-based technologies. Our case studies using real applications demonstrate that the framework is effective to separate platform-specific optimizations from application codes, and to incrementally improve the performance of an existing application without messing up the code.

本文提出了一个可扩展的编程框架，将特定平台的优化与应用程序代码分离。该框架允许程序员定义他们自己的代码翻译规则，以满足个别系统、编译器、库和应用程序的特殊需求。与用户定义的编译器指令相关联的代码翻译规则在外部文件中定义，应用程序代码只是由指令注释。对于基于规则的代码转换，框架将应用程序代码的抽象语法树(AST)作为XML文档公开给专业程序员。因此，AST的XML文档可以使用任何基于XML的技术进行转换。我们使用真实应用程序的案例研究表明，该框架可以有效地将特定于平台的优化与应用程序代码分离，并在不弄乱代码的情况下逐步提高现有应用程序的性能。

引用次数: 35

Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms OpenCL程序在CPU/GPU异构平台上的智能多任务调度

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116910

Y. Wen, Zheng Wang, M. O’Boyle

Heterogeneous systems consisting of multiple CPUs and GPUs are increasingly attractive as platforms for high performance computing. Such platforms are usually programmed using OpenCL which provides program portability by allowing the same program to execute on different types of device. As such systems become more mainstream, they will move from application dedicated devices to platforms that need to support multiple concurrent user applications. Here there is a need to determine when and where to map different applications so as to best utilize the available heterogeneous hardware resources. In this paper, we present an efficient OpenCL task scheduling scheme which schedules multiple kernels from multiple programs on CPU/GPU heterogeneous platforms. It does this by determining at runtime which kernels are likely to best utilize a device. We show that speedup is a good scheduling priority function and develop a novel model that predicts a kernel's speedup based on its static code structure. Our scheduler uses this prediction and runtime input data size to prioritize and schedule tasks. This technique is applied to a large set of concurrent OpenCL kernels. We evaluated our approach for system throughput and average turn-around time against competitive techniques on two different platforms: a Core i7/Nvidia GTX590 and a Core i7/AMD Tahiti 7970 platforms. For system throughput, we achieve, on average, a 1.21x and 1.25x improvement over the best competitors on the NVIDIA and AMD platforms respectively. Our approach reduces the turnaround time, on average, by at least 1.5x and 1.2x on the NVIDIA and AMD platforms respectively, when compared to alternative approaches.

由多个cpu和gpu组成的异构系统作为高性能计算平台越来越有吸引力。这样的平台通常使用OpenCL编程，它通过允许相同的程序在不同类型的设备上执行来提供程序可移植性。随着这类系统变得越来越主流，它们将从应用程序专用设备转向需要支持多个并发用户应用程序的平台。这里需要确定何时何地映射不同的应用程序，以便最好地利用可用的异构硬件资源。本文提出了一种高效的OpenCL任务调度方案，该方案在CPU/GPU异构平台上调度来自多个程序的多个内核。它通过在运行时确定哪些内核可能最好地利用某个设备来实现这一点。我们证明了加速是一个很好的调度优先级函数，并建立了一个基于内核静态代码结构预测内核加速的新模型。我们的调度器使用此预测和运行时输入数据大小来确定任务的优先级和调度。该技术应用于大量并发OpenCL内核。我们在两个不同的平台上评估了我们的系统吞吐量和平均周转时间与竞争技术的方法:Core i7/Nvidia GTX590和Core i7/AMD Tahiti 7970平台。在系统吞吐量方面，我们在NVIDIA和AMD平台上分别比最优秀的竞争对手平均提高了1.21倍和1.25倍。与其他方法相比，我们的方法在NVIDIA和AMD平台上平均减少了至少1.5倍和1.2倍的周转时间。

{"title":"Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms","authors":"Y. Wen, Zheng Wang, M. O’Boyle","doi":"10.1109/HiPC.2014.7116910","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116910","url":null,"abstract":"Heterogeneous systems consisting of multiple CPUs and GPUs are increasingly attractive as platforms for high performance computing. Such platforms are usually programmed using OpenCL which provides program portability by allowing the same program to execute on different types of device. As such systems become more mainstream, they will move from application dedicated devices to platforms that need to support multiple concurrent user applications. Here there is a need to determine when and where to map different applications so as to best utilize the available heterogeneous hardware resources. In this paper, we present an efficient OpenCL task scheduling scheme which schedules multiple kernels from multiple programs on CPU/GPU heterogeneous platforms. It does this by determining at runtime which kernels are likely to best utilize a device. We show that speedup is a good scheduling priority function and develop a novel model that predicts a kernel's speedup based on its static code structure. Our scheduler uses this prediction and runtime input data size to prioritize and schedule tasks. This technique is applied to a large set of concurrent OpenCL kernels. We evaluated our approach for system throughput and average turn-around time against competitive techniques on two different platforms: a Core i7/Nvidia GTX590 and a Core i7/AMD Tahiti 7970 platforms. For system throughput, we achieve, on average, a 1.21x and 1.25x improvement over the best competitors on the NVIDIA and AMD platforms respectively. Our approach reduces the turnaround time, on average, by at least 1.5x and 1.2x on the NVIDIA and AMD platforms respectively, when compared to alternative approaches.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129062894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 134

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 21st International Conference on High Performance Computing (HiPC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀