2014 21st International Conference on High Performance Computing (HiPC)最新文献_第2页

On the suitability of MPI as a PGAS runtime 关于MPI作为PGAS运行时的适用性

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116712

J. Daily, Abhinav Vishnu, B. Palmer, H. V. Dam, D. Kerbyson

Partitioned Global Address Space (PGAS) models are emerging as a popular alternative to MPI models for designing scalable applications. At the same time, MPI remains a ubiquitous communication subsystem due to its standardization, high performance, and availability on leading platforms. In this paper, we explore the suitability of using MPI as a scalable PGAS communication subsystem. We focus on the Remote Memory Access (RMA) communication in PGAS models which typically includes get, put, and atomic memory operations. We perform an in-depth exploration of design alternatives based on MPI. These alternatives include using a semantically-matching interface such as MPI-RMA, as well as not-so-intuitive interfaces such as MPI two-sided with a combination of multi-threading and dynamic process management. With an in-depth exploration of these alternatives and their shortcomings, we propose a novel design which is facilitated by the data-centric view in PGAS models. This design leverages a combination of highly tuned MPI two-sided semantics and an automatic, user-transparent split of MPI communicators to provide asynchronous progress. We implement the asynchronous progress ranks approach and other approaches within the Communication Runtime for Exascale which is a communication subsystem for Global Arrays. Our performance evaluation spans pure communication benchmarks, graph community detection and sparse matrix-vector multiplication kernels, and a computational chemistry application. The utility of our proposed PR-based approach is demonstrated by a 2.17x speedup on 1008 processors over the other MPI-based designs.

分区全局地址空间(PGAS)模型正在成为MPI模型的流行替代方案，用于设计可伸缩的应用程序。同时，由于其标准化、高性能和在领先平台上的可用性，MPI仍然是一个无处不在的通信子系统。在本文中，我们探讨了使用MPI作为可扩展PGAS通信子系统的适用性。我们关注PGAS模型中的远程内存访问(RMA)通信，它通常包括get、put和原子内存操作。我们对基于MPI的设计方案进行了深入的探索。这些替代方案包括使用语义匹配的接口，如MPI- rma，以及不太直观的接口，如MPI双面，结合多线程和动态进程管理。通过对这些备选方案及其缺点的深入探讨，我们提出了一种新的设计方案，该方案由PGAS模型中的数据中心视图提供支持。该设计结合了高度调优的MPI双边语义和MPI通信器的自动、用户透明分割，以提供异步进度。我们在Exascale的通信运行时中实现异步进度排序方法和其他方法，Exascale是全局数组的通信子系统。我们的性能评估跨越了纯粹的通信基准，图社区检测和稀疏矩阵向量乘法核，以及计算化学应用。与其他基于mpi的设计相比，我们提出的基于pr的方法在1008处理器上的加速提高了2.17倍，这证明了我们提出的基于pr的方法的实用性。

{"title":"On the suitability of MPI as a PGAS runtime","authors":"J. Daily, Abhinav Vishnu, B. Palmer, H. V. Dam, D. Kerbyson","doi":"10.1109/HiPC.2014.7116712","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116712","url":null,"abstract":"Partitioned Global Address Space (PGAS) models are emerging as a popular alternative to MPI models for designing scalable applications. At the same time, MPI remains a ubiquitous communication subsystem due to its standardization, high performance, and availability on leading platforms. In this paper, we explore the suitability of using MPI as a scalable PGAS communication subsystem. We focus on the Remote Memory Access (RMA) communication in PGAS models which typically includes get, put, and atomic memory operations. We perform an in-depth exploration of design alternatives based on MPI. These alternatives include using a semantically-matching interface such as MPI-RMA, as well as not-so-intuitive interfaces such as MPI two-sided with a combination of multi-threading and dynamic process management. With an in-depth exploration of these alternatives and their shortcomings, we propose a novel design which is facilitated by the data-centric view in PGAS models. This design leverages a combination of highly tuned MPI two-sided semantics and an automatic, user-transparent split of MPI communicators to provide asynchronous progress. We implement the asynchronous progress ranks approach and other approaches within the Communication Runtime for Exascale which is a communication subsystem for Global Arrays. Our performance evaluation spans pure communication benchmarks, graph community detection and sparse matrix-vector multiplication kernels, and a computational chemistry application. The utility of our proposed PR-based approach is demonstrated by a 2.17x speedup on 1008 processors over the other MPI-based designs.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126354190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Combining HoL-blocking avoidance and differentiated services in high-speed interconnects 结合高速互连中hol阻塞避免和差异化服务

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116874

P. Yébenes, J. Escudero-Sahuquillo, Crispín Gómez Requena, P. García, F. J. Alfaro, F. Quiles, J. Duato

Current high-performance platforms such as Datacenters or High-Performance Computing systems rely on highspeed interconnection networks able to cope with the ever-increasing communication requirements of modern applications. In particular, in high-performance systems that must offer differentiated services to applications which involve traffic prioritization, it is almost mandatory that the interconnection network provides some type of Quality-of-Service (QoS) and Congestion-Management mechanism in order to achieve the required network performance. Most current QoS and Congestion-Management mechanisms for high-speed interconnects are based on using the same kind of resources, but with different criteria, resulting in disjoint types of mechanisms. By contrast, we propose in this paper a novel, straightforward solution that leverages the resources already available in InfiniBand components (basically Service Levels and Virtual Lanes) to provide both QoS and Congestion Management at the same time. This proposal is called CHADS (Combined HoL-blocking Avoidance and Differentiated Services), and it could be applied to any network topology. From the results shown in this paper for networks configured with the novel, cost-efficient KNS hybrid topology, we can conclude that CHADS is more efficient than other schemes in reducing the interferences among packet flows that have the same or different priorities.

当前的高性能平台，如数据中心或高性能计算系统，依赖于高速互连网络来应对现代应用程序日益增长的通信需求。特别是，在高性能系统中，必须向涉及流量优先级的应用程序提供差异化服务，为了实现所需的网络性能，互连网络提供某种类型的服务质量(QoS)和拥塞管理机制几乎是强制性的。目前大多数高速互连的QoS和拥塞管理机制都是基于使用相同类型的资源，但标准不同，导致机制类型不一致。相比之下，我们在本文中提出了一种新颖、直接的解决方案，它利用InfiniBand组件(基本上是服务级别和虚拟通道)中已有的资源，同时提供QoS和拥塞管理。该方案被称为CHADS (Combined HoL-blocking Avoidance and Differentiated Services)，可以应用于任何网络拓扑结构。从本文中显示的具有新颖，成本效益的KNS混合拓扑配置的网络的结果中，我们可以得出结论，CHADS在减少具有相同或不同优先级的分组流之间的干扰方面比其他方案更有效。

{"title":"Combining HoL-blocking avoidance and differentiated services in high-speed interconnects","authors":"P. Yébenes, J. Escudero-Sahuquillo, Crispín Gómez Requena, P. García, F. J. Alfaro, F. Quiles, J. Duato","doi":"10.1109/HiPC.2014.7116874","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116874","url":null,"abstract":"Current high-performance platforms such as Datacenters or High-Performance Computing systems rely on highspeed interconnection networks able to cope with the ever-increasing communication requirements of modern applications. In particular, in high-performance systems that must offer differentiated services to applications which involve traffic prioritization, it is almost mandatory that the interconnection network provides some type of Quality-of-Service (QoS) and Congestion-Management mechanism in order to achieve the required network performance. Most current QoS and Congestion-Management mechanisms for high-speed interconnects are based on using the same kind of resources, but with different criteria, resulting in disjoint types of mechanisms. By contrast, we propose in this paper a novel, straightforward solution that leverages the resources already available in InfiniBand components (basically Service Levels and Virtual Lanes) to provide both QoS and Congestion Management at the same time. This proposal is called CHADS (Combined HoL-blocking Avoidance and Differentiated Services), and it could be applied to any network topology. From the results shown in this paper for networks configured with the novel, cost-efficient KNS hybrid topology, we can conclude that CHADS is more efficient than other schemes in reducing the interferences among packet flows that have the same or different priorities.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"391 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116664449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

An improved recursive graph bipartitioning algorithm for well balanced domain decomposition 一种改进的平衡域分解递归图二分区算法

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116878

Astrid Casadei, P. Ramet, J. Roman

In the context of hybrid sparse linear solvers based on domain decomposition and Schur complement approaches, getting a domain decomposition tool leading to a good balancing of both the internal node set size and the interface node set size for all the domains is a critical point for load balancing and efficiency issues in a parallel computation context. For this purpose, we revisit the original algorithm introduced by Lipton, Rose and Tarjan [1] in 1979 which performed the recursion for nested dissection in a particular manner. From this specific recursive strategy, we propose in this paper several variations of the existing algorithms in the multilevel Scotch partitioner that take into account these multiple criteria and we illustrate the improved results on a collection of graphs corresponding to finite element meshes used in numerical scientific applications.

在基于域分解和Schur互补方法的混合稀疏线性求解器中，获得一种能够很好地平衡所有域的内部节点集大小和接口节点集大小的域分解工具是并行计算环境中负载平衡和效率问题的关键。为此，我们回顾了Lipton, Rose和Tarjan[1]在1979年引入的原始算法，该算法以特定的方式对嵌套解剖进行递归。从这种特定的递归策略出发，我们在本文中提出了考虑到这些多重标准的多层Scotch分区中现有算法的几种变体，并举例说明了在数值科学应用中使用的与有限元网格对应的一组图上的改进结果。

引用次数: 4

GPU parallelization of the stochastic on-time arrival problem GPU并行化的随机准时到达问题

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116896

Maleen Abeydeera, S. Samaranayake

The Stochastic On-Time Arrival (SOTA) problem has recently been studied as an alternative to traditional shortest-path formulations in situations with hard deadlines. The goal is to find a routing strategy that maximizes the probability of reaching the destination within a pre-specified time budget, with the edge weights of the graph being random variables with arbitrary distributions. While this is a practically useful formulation for vehicle routing, the commercial deployment of such methods is not currently feasible due to the high computational complexity of existing solutions. We present a parallelization strategy for improving the computation times by multiple orders of magnitude compared to the single threaded CPU implementations, using a CUDA GPU implementation. A single order of magnitude is achieved via naive parallelization of the problem, and another order of magnitude via optimal utilization of the GPU resources. We also show that the runtime can be further reduced in certain cases using dynamic thread assignment and an edge clustering method for accelerating queries with a small time budget.

随机准时到达(SOTA)问题作为传统最短路径公式的替代方案，在有硬性截止日期的情况下，最近得到了研究。目标是找到一种路由策略，使在预先指定的时间预算内到达目的地的概率最大化，图的边权是具有任意分布的随机变量。虽然这是一种实用的车辆路线规划公式，但由于现有解决方案的高计算复杂性，这种方法的商业部署目前尚不可行。我们提出了一种并行化策略，与单线程CPU实现相比，使用CUDA GPU实现可以将计算时间提高多个数量级。一个数量级是通过问题的朴素并行化实现的，另一个数量级是通过GPU资源的最佳利用实现的。我们还展示了在某些情况下，使用动态线程分配和边缘聚类方法以较小的时间预算加速查询可以进一步减少运行时。

引用次数: 5

DRIVE: Using implicit caching hints to achieve disk I/O reduction in virtualized environments DRIVE:在虚拟化环境中使用隐式缓存提示来实现磁盘I/O减少

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116877

Sujesha Sudevalayam, Purushottam Kulkarni

Co-hosting of virtualized applications results in similar content across multiple blocks on disk, which are fetched into memory (the host's page cache). Content similarity can be harnessed both to avoid duplicate disk I/O requests that fetch the same content repeatedly, as well as to prevent multiple occurrences of duplicate content in cache. Typically, caches store the most recently or frequently accessed blocks to reduce the number of disk read accesses. These caches are referenced by block number, and can not recognize content similarity across multiple blocks. Existing work in memory deduplication merges cache pages after multiple identical blocks have already been fetched from disk into cache, while existing work in I/O deduplication reserves a portion of the host-cache to be maintained as a content-aware cache. We propose a disk I/O reduction system for the virtualization environment that addresses the dual problems of duplicate I/O and duplicate content in the host-cache, without being invasive. We build a disk read-access optimization called DRIVE, that identifies content similarity across multiple blocks, and performs hint-based read I/O redirection to improve cache effectiveness, thus reducing the number of disk reads further. A metadata store is maintained based on the virtual machine's disk accesses and implicit caching hints are collected for future read I/O redirection. The read I/O redirection is performed from within the virtual block device in the virtualized system, to manipulate the entire host-cache as a content-deduplicated cache implicitly. Our trace-based evaluation using a custom simulator, reveals that DRIVE always performs equal to or better than the Vanilla system, achieving up to 20% better cache-hit ratios and reducing the number of disk reads by up to 80%. The results also indicate that our system is able to achieve up to 97% content deduplication in the host-cache.

虚拟应用程序的共同托管导致磁盘上多个块的内容相似，这些块被提取到内存(主机的页面缓存)中。可以利用内容相似性来避免重复获取相同内容的重复磁盘I/O请求，以及防止在缓存中多次出现重复内容。通常，缓存存储最近访问或频繁访问的块，以减少磁盘读访问的次数。这些缓存是通过块号来引用的，并且不能识别跨多个块的内容相似性。现有的内存重复数据删除工作在多个相同的块已经从磁盘提取到缓存后合并缓存页，而现有的I/O重复数据删除工作保留了一部分主机缓存作为内容感知缓存来维护。我们为虚拟化环境提出了一个磁盘I/O减少系统，该系统解决了主机缓存中重复I/O和重复内容的双重问题，而不具有侵入性。我们构建了一个名为DRIVE的磁盘读访问优化，它可以识别跨多个块的内容相似性，并执行基于提示的读I/O重定向以提高缓存效率，从而进一步减少磁盘读的数量。根据虚拟机的磁盘访问维护元数据存储，并收集隐式缓存提示，以便将来进行读I/O重定向。读I/O重定向是从虚拟系统中的虚拟块设备内部执行的，以隐式地将整个主机缓存作为内容重复数据删除缓存来操作。我们使用自定义模拟器进行基于跟踪的评估，结果显示DRIVE的性能始终等于或优于Vanilla系统，缓存命中率提高了20%，磁盘读取次数减少了80%。结果还表明，我们的系统能够在主机缓存中实现高达97%的内容重复删除。

{"title":"DRIVE: Using implicit caching hints to achieve disk I/O reduction in virtualized environments","authors":"Sujesha Sudevalayam, Purushottam Kulkarni","doi":"10.1109/HiPC.2014.7116877","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116877","url":null,"abstract":"Co-hosting of virtualized applications results in similar content across multiple blocks on disk, which are fetched into memory (the host's page cache). Content similarity can be harnessed both to avoid duplicate disk I/O requests that fetch the same content repeatedly, as well as to prevent multiple occurrences of duplicate content in cache. Typically, caches store the most recently or frequently accessed blocks to reduce the number of disk read accesses. These caches are referenced by block number, and can not recognize content similarity across multiple blocks. Existing work in memory deduplication merges cache pages after multiple identical blocks have already been fetched from disk into cache, while existing work in I/O deduplication reserves a portion of the host-cache to be maintained as a content-aware cache. We propose a disk I/O reduction system for the virtualization environment that addresses the dual problems of duplicate I/O and duplicate content in the host-cache, without being invasive. We build a disk read-access optimization called DRIVE, that identifies content similarity across multiple blocks, and performs hint-based read I/O redirection to improve cache effectiveness, thus reducing the number of disk reads further. A metadata store is maintained based on the virtual machine's disk accesses and implicit caching hints are collected for future read I/O redirection. The read I/O redirection is performed from within the virtual block device in the virtualized system, to manipulate the entire host-cache as a content-deduplicated cache implicitly. Our trace-based evaluation using a custom simulator, reveals that DRIVE always performs equal to or better than the Vanilla system, achieving up to 20% better cache-hit ratios and reducing the number of disk reads by up to 80%. The results also indicate that our system is able to achieve up to 97% content deduplication in the host-cache.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123540867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Software based ultrasound B-mode/beamforming optimization on GPU and its performance prediction 基于GPU的超声b模/波束成形软件优化及其性能预测

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116911

T. Phuong, Jeong-Gun Lee

In the paper, we design and optimize an ultrasound B-mode imaging including a high-computationally demanding beamformer on a commercial GPU. For the performance optimization, we explore the design space spanned with the use of different memory types, instruction scheduling and thread mapping strategies, etc. Then, with the developed B-mode imaging code, we conduct performance evaluations on various GPUs having different architectural features (e.g., the number of cores and core frequency). Through the experiments on various different GPU devices, we search “performance-significant-factors” which are hardware features of affecting B-mode imaging performance. Then, the analytical relationship between these GPU architectural design factors and the B-mode imaging performance is derived for our target application. At the commercial aspect of developing a product, we can select GPU architectures which are best suitable for the ultrasound applications through the prediction model. In the future, using the predictions, it would be also possible to customize a “cost-minimal” GPU architecture which satisfies a given performance constraint. In addition, the prediction model can be used to dynamically control the activity of GPU components according to the temporal requirement of performance and power/energy consumptions in portable ultrasound diagnosis systems.

在本文中，我们在商用GPU上设计并优化了一个包含高计算要求的波束形成器的超声b模成像。在性能优化方面，我们探索了使用不同内存类型、指令调度和线程映射策略等所跨越的设计空间。然后，使用开发的b模式成像代码，我们对具有不同架构特征(如核数和核频)的各种gpu进行性能评估。通过在不同GPU设备上的实验，我们搜索了影响b模式成像性能的硬件特征“性能显著因素”。然后，针对我们的目标应用，推导了这些GPU架构设计因素与b模式成像性能之间的分析关系。在开发产品的商业方面，我们可以通过预测模型选择最适合超声应用的GPU架构。在未来，使用这些预测，还可以定制满足给定性能约束的“最低成本”GPU架构。此外，该预测模型还可以根据便携式超声诊断系统对性能和功耗的时间要求，动态控制GPU组件的活动。

{"title":"Software based ultrasound B-mode/beamforming optimization on GPU and its performance prediction","authors":"T. Phuong, Jeong-Gun Lee","doi":"10.1109/HiPC.2014.7116911","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116911","url":null,"abstract":"In the paper, we design and optimize an ultrasound B-mode imaging including a high-computationally demanding beamformer on a commercial GPU. For the performance optimization, we explore the design space spanned with the use of different memory types, instruction scheduling and thread mapping strategies, etc. Then, with the developed B-mode imaging code, we conduct performance evaluations on various GPUs having different architectural features (e.g., the number of cores and core frequency). Through the experiments on various different GPU devices, we search “performance-significant-factors” which are hardware features of affecting B-mode imaging performance. Then, the analytical relationship between these GPU architectural design factors and the B-mode imaging performance is derived for our target application. At the commercial aspect of developing a product, we can select GPU architectures which are best suitable for the ultrasound applications through the prediction model. In the future, using the predictions, it would be also possible to customize a “cost-minimal” GPU architecture which satisfies a given performance constraint. In addition, the prediction model can be used to dynamically control the activity of GPU components according to the temporal requirement of performance and power/energy consumptions in portable ultrasound diagnosis systems.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122528109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A multilevel compressed sparse row format for efficient sparse computations on multicore processors 用于多核处理器上高效稀疏计算的多级压缩稀疏行格式

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116882

H. Kabir, J. Booth, P. Raghavan

We seek to improve the performance of sparse matrix computations on multicore processors with non-uniform memory access (NUMA). Typical implementations use a bandwidth reducing ordering of the matrix to increase locality of accesses with a compressed storage format to store and operate only on the non-zero values. We propose a new multilevel storage format and a companion ordering scheme as an explicit adaptation to map to NUMA hierarchies. More specifically, we propose CSR-k, a multilevel form of the popular compressed sparse row (CSR) format for a multicore processor with k > 1 well-differentiated levels in the memory subsystem. Additionally, we develop Band-k, a modified form of a traditional bandwidth reduction scheme, to convert a matrix represented in CSRto our proposed CSR-k. We evaluate the performance of the widely-used and important sparse matrix-vector multiplication (SpMV) kernel using CSR-2 on Intel Westmere processors for a test suite of 12 large sparse matrices with row densities in the range 3 to 45. On 32 cores, on average across all matrices in the test suite, the execution time for SpMV with CSR-2is less than 42% of the time taken by the state-of-the-art automatically tuned SpMV resulting in energy savings of approximately 56%. Additionally, on average, the parallel speed-up on 32 cores of the automatically tuned SpMV relative to its 1-core performance is 8.18 compared to a value of 19.71 for CSR-2. Our analysis indicates that the higher performance of SpMV with CSR-2 comes from achieving higher reuse of x in the shared L3 cache without incurring overheads from fill-in of original zeroes. Furthermore, the pre-processing costs of SpMV with CSR-2 can be amortized on average over 97 iterations of SpMV using CSR and are substantially lower than the 513 iterations required for the automatically tuned implementation. Based on these results, CSR-k appears to be a promising multilevel formulation of CSR for adapting sparse computations to multicore processors with NUMA memory hierarchies.

我们寻求在非均匀内存访问(NUMA)的多核处理器上提高稀疏矩阵计算的性能。典型的实现使用带宽降低矩阵的排序来增加访问的局部性，并使用压缩存储格式来存储和操作非零值。我们提出了一种新的多层存储格式和配套的排序方案，作为映射到NUMA层次结构的显式适应。更具体地说，我们提出了CSR-k，这是一种流行的压缩稀疏行(CSR)格式的多级形式，适用于内存子系统中具有k bbbb1个良好区分级别的多核处理器。此外，我们开发了Band-k，这是传统带宽减少方案的一种改进形式，用于将csr中表示的矩阵转换为我们提出的CSR-k。我们在Intel Westmere处理器上使用CSR-2对12个行密度在3到45之间的大型稀疏矩阵测试套件评估了广泛使用且重要的稀疏矩阵向量乘法(SpMV)内核的性能。在32个内核上，测试套件中所有矩阵的平均执行时间，使用csr -2的SpMV的执行时间不到最先进的自动调优SpMV所花费时间的42%，从而节省了大约56%的能源。此外，平均而言，32核自动调优SpMV相对于其1核性能的并行加速是8.18，而CSR-2的并行加速是19.71。我们的分析表明，使用CSR-2的SpMV的更高性能来自于在共享L3缓存中实现更高的x重用，而不会因填充原始零而产生开销。此外，使用CSR-2的SpMV的预处理成本可以平均分摊到使用CSR的97次SpMV迭代中，并且大大低于自动调优实现所需的513次迭代。基于这些结果，CSR-k似乎是一种很有前途的多层CSR公式，用于使稀疏计算适应具有NUMA内存层次结构的多核处理器。

{"title":"A multilevel compressed sparse row format for efficient sparse computations on multicore processors","authors":"H. Kabir, J. Booth, P. Raghavan","doi":"10.1109/HiPC.2014.7116882","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116882","url":null,"abstract":"We seek to improve the performance of sparse matrix computations on multicore processors with non-uniform memory access (NUMA). Typical implementations use a bandwidth reducing ordering of the matrix to increase locality of accesses with a compressed storage format to store and operate only on the non-zero values. We propose a new multilevel storage format and a companion ordering scheme as an explicit adaptation to map to NUMA hierarchies. More specifically, we propose CSR-k, a multilevel form of the popular compressed sparse row (CSR) format for a multicore processor with k > 1 well-differentiated levels in the memory subsystem. Additionally, we develop Band-k, a modified form of a traditional bandwidth reduction scheme, to convert a matrix represented in CSRto our proposed CSR-k. We evaluate the performance of the widely-used and important sparse matrix-vector multiplication (SpMV) kernel using CSR-2 on Intel Westmere processors for a test suite of 12 large sparse matrices with row densities in the range 3 to 45. On 32 cores, on average across all matrices in the test suite, the execution time for SpMV with CSR-2is less than 42% of the time taken by the state-of-the-art automatically tuned SpMV resulting in energy savings of approximately 56%. Additionally, on average, the parallel speed-up on 32 cores of the automatically tuned SpMV relative to its 1-core performance is 8.18 compared to a value of 19.71 for CSR-2. Our analysis indicates that the higher performance of SpMV with CSR-2 comes from achieving higher reuse of x in the shared L3 cache without incurring overheads from fill-in of original zeroes. Furthermore, the pre-processing costs of SpMV with CSR-2 can be amortized on average over 97 iterations of SpMV using CSR and are substantially lower than the 513 iterations required for the automatically tuned implementation. Based on these results, CSR-k appears to be a promising multilevel formulation of CSR for adapting sparse computations to multicore processors with NUMA memory hierarchies.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125857279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

TriKon: A hypervisor aware manycore processor TriKon:一个监控程序感知的多核处理器

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116710

Rohan Bhalla, Prathmesh Kallurkar, Nitin Gupta, S. Sarangi

Virtualization is increasingly being deployed to run applications in a cloud computing environment. Sadly, there are overheads associated with hypervisors that can prohibitively reduce application performance. A major source of the overheads is the destructive interference between the application, OS, and hypervisor in the memory system. We characterize such overheads in this paper, and propose the design of a novel Triangle cache that can effectively mitigate destructive interference across these three classes of workloads. We subsequently, proceed to design the TriKon manycore processor that consists of a set of heterogeneous cores with caches of different sizes, and Triangle caches. To maximize the throughput of the system as a whole, we propose a dynamic scheduling algorithm for scheduling a class of system and CPU intensive applications on the set of heterogeneous cores. The area of the TriKon processor is within 2% of a baseline processor, and with such a system, we could achieve a performance gain of 12% for a suite of benchmarks. Within this suite, the system intensive benchmarks show a performance gain of 20% while the performance of the compute intensive ones remains unaffected. Also, by allocating extra area for cores with sophisticated cache designs, we further improved the performance of the system intensive benchmarks to 30%.

越来越多地部署虚拟化来在云计算环境中运行应用程序。遗憾的是，与管理程序相关的开销可能会严重降低应用程序的性能。开销的一个主要来源是内存系统中应用程序、操作系统和管理程序之间的破坏性干扰。我们在本文中描述了这种开销，并提出了一种新型三角形缓存的设计，该缓存可以有效地减轻这三类工作负载之间的破坏性干扰。随后，我们继续设计TriKon多核处理器，该处理器由一组具有不同大小缓存的异构内核和三角形缓存组成。为了使整个系统的吞吐量最大化，我们提出了一种动态调度算法，用于在异构内核集上调度一类系统和CPU密集型应用程序。TriKon处理器的面积在基准处理器的2%以内，使用这样的系统，我们可以在一系列基准测试中实现12%的性能提升。在这个套件中，系统密集型基准测试显示性能提高了20%，而计算密集型基准测试的性能不受影响。此外，通过为具有复杂缓存设计的核心分配额外的区域，我们进一步将系统密集型基准测试的性能提高了30%。

{"title":"TriKon: A hypervisor aware manycore processor","authors":"Rohan Bhalla, Prathmesh Kallurkar, Nitin Gupta, S. Sarangi","doi":"10.1109/HiPC.2014.7116710","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116710","url":null,"abstract":"Virtualization is increasingly being deployed to run applications in a cloud computing environment. Sadly, there are overheads associated with hypervisors that can prohibitively reduce application performance. A major source of the overheads is the destructive interference between the application, OS, and hypervisor in the memory system. We characterize such overheads in this paper, and propose the design of a novel Triangle cache that can effectively mitigate destructive interference across these three classes of workloads. We subsequently, proceed to design the TriKon manycore processor that consists of a set of heterogeneous cores with caches of different sizes, and Triangle caches. To maximize the throughput of the system as a whole, we propose a dynamic scheduling algorithm for scheduling a class of system and CPU intensive applications on the set of heterogeneous cores. The area of the TriKon processor is within 2% of a baseline processor, and with such a system, we could achieve a performance gain of 12% for a suite of benchmarks. Within this suite, the system intensive benchmarks show a performance gain of 20% while the performance of the compute intensive ones remains unaffected. Also, by allocating extra area for cores with sophisticated cache designs, we further improved the performance of the system intensive benchmarks to 30%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129384944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An early experience of regional ocean modelling on intel many integrated core architecture 在intel多集成核心架构上进行区域海洋建模的早期经验

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116907

Srikanth Yalavarthi, A. Kaginalkar

Ocean modelling is an inherently complex phenomenon within the earth system framework which poses a challenge to the earth and computational scientists. Simulation of wide temporal and spatial scales in real-time / near real-time necessitates computational scientists to explore new performance enhancing architectures and simulation methods. Due to the spectra of the scales of motion, the computational requirements of ocean forecasting are relatively higher than that of numerical weather prediction. To some extent, the rapidly evolving computer technology provide solutions to this. In this paper, we present initial attempts in porting a high resolution regional ocean model on an Intel MIC based hybrid system. We discuss the challenges and issues to be addressed in achieving an efficient implementation of an ocean modelling system.

海洋建模是地球系统框架内一个固有的复杂现象，对地球和计算科学家提出了挑战。实时/近实时的大时空尺度仿真要求计算科学家探索新的性能增强架构和仿真方法。由于运动尺度的频谱性，海洋预报对计算量的要求比数值天气预报要高。在某种程度上，快速发展的计算机技术提供了解决方案。在本文中，我们提出了在基于Intel MIC的混合系统上移植高分辨率区域海洋模型的初步尝试。我们讨论了在实现海洋建模系统的有效实施中要解决的挑战和问题。

引用次数: 3

GpuTejas: A parallel simulator for GPU architectures GpuTejas: GPU架构的并行模拟器

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116897

Geetika Malhotra, Seep Goel, S. Sarangi

In this paper, we introduce a new Java-based parallel GPGPU simulator, GpuTejas. GpuTejas is a fast trace driven simulator, which uses relaxed synchronization, and non-blocking data structures to derive its speedups. Secondly, it introduces a novel scheduling and partitioning scheme for parallelizing a GPU simulator. We evaluate the performance of our simulator with a set of Rodinia benchmarks. We demonstrate a mean speedup of 17.33x with 64 threads over sequential execution, and a speedup of 429X over the widely used simulator GPGPU-Sim. We validated our timing and simulation model by comparing our results with a native system (NVIDIA Tesla M2070). As compared to the sequential version of GpuTejas, the parallel version has an error limited to <;7.67% for our suite of benchmarks, which is similar to the numbers reported by competing parallel simulators.

本文介绍了一种新的基于java的并行GPGPU模拟器GpuTejas。GpuTejas是一种快速跟踪驱动模拟器，它使用宽松的同步和非阻塞数据结构来获得其速度。其次，介绍了一种新的GPU模拟器并行化调度和分区方案。我们用一组Rodinia基准来评估模拟器的性能。我们演示了64线程顺序执行的平均加速速度为17.33x，在广泛使用的模拟器GPGPU-Sim上的加速速度为429X。我们通过将我们的结果与本地系统(NVIDIA Tesla M2070)进行比较来验证我们的时序和仿真模型。与连续版本的GpuTejas相比，在我们的基准测试套件中，并行版本的误差限制在< 7.67%，这与竞争的并行模拟器报告的数字相似。

引用次数: 22