2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献

英文中文

An Accelerated Recursive Doubling Algorithm for Block Tridiagonal Systems 块三对角系统的加速递归加倍算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.107

S. Seal

Block tridiagonal systems of linear equations arise in a wide variety of scientific and engineering applications. Recursive doubling algorithm is a well-known prefix computation-based numerical algorithm that requires O(M3(N/P + log P)) work to compute the solution of a block tridiagonal system with N block rows and block size M on F processors. In real-world applications, solutions of tridiagonal systems are most often sought with multiple, often hundreds and thousands, of different right hand sides but with the same tridiagonal matrix. Here, we show that a recursive doubling algorithm is sub-optimal when computing solutions of block tridiagonal systems with multiple right hand sides and present a novel algorithm, called the accelerated recursive doubling algorithm, that delivers O(R) improvement when solving block tridiagonal systems with R distinct right hand sides. Since R is typically ~102 - 104, this improvement translates to very significant speedups in practice. Detailed complexity analyses of the new algorithm with empirical confirmation of runtime improvements are presented. To the best of our knowledge, this algorithm has not been reported before in the literature.

线性方程组的块三对角线系统在各种科学和工程应用中广泛出现。递归加倍算法是一种众所周知的基于前缀计算的数值算法，它需要O(M3(N/P + log P))功来计算F个处理器上N个块行、块大小为M的块三对角线系统的解。在实际应用中，三对角系统的解通常是用多个，通常是成百上千个不同的右手边，但具有相同的三对角矩阵来寻找。在这里，我们证明了递归加倍算法在计算具有多个右手边的块三对角系统的解时是次优的，并提出了一种新的算法，称为加速递归加倍算法，在求解具有R个不同右手边的块三对角系统时提供O(R)改进。由于R通常为~102 - 104，因此这种改进在实践中转化为非常显著的加速。对新算法进行了详细的复杂度分析，并对改进的运行时间进行了实证验证。据我们所知，该算法在之前的文献中没有报道过。

{"title":"An Accelerated Recursive Doubling Algorithm for Block Tridiagonal Systems","authors":"S. Seal","doi":"10.1109/IPDPS.2014.107","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.107","url":null,"abstract":"Block tridiagonal systems of linear equations arise in a wide variety of scientific and engineering applications. Recursive doubling algorithm is a well-known prefix computation-based numerical algorithm that requires O(M3(N/P + log P)) work to compute the solution of a block tridiagonal system with N block rows and block size M on F processors. In real-world applications, solutions of tridiagonal systems are most often sought with multiple, often hundreds and thousands, of different right hand sides but with the same tridiagonal matrix. Here, we show that a recursive doubling algorithm is sub-optimal when computing solutions of block tridiagonal systems with multiple right hand sides and present a novel algorithm, called the accelerated recursive doubling algorithm, that delivers O(R) improvement when solving block tridiagonal systems with R distinct right hand sides. Since R is typically ~102 - 104, this improvement translates to very significant speedups in practice. Detailed complexity analyses of the new algorithm with empirical confirmation of runtime improvements are presented. To the best of our knowledge, this algorithm has not been reported before in the literature.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129057278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery FMI:用于快速透明恢复的容错消息接口

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.126

Kento Sato, A. Moody, K. Mohror, T. Gamblin, B. Supinski, N. Maruyama, S. Matsuoka

Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems, it is unlikely that traditional C/R will recover a failed application before the next failure occurs. To address this problem, we present the Fault Tolerant Messaging Interface (FMI), which enables extremely low-latency recovery. FMI accomplishes this using a survivable communication runtime coupled with fast, in-memory C/R, and dynamic node allocation. FMI provides message-passing semantics similar to MPI, but applications written using FMI can run through failures. The FMI runtime software handles fault tolerance, including check pointing application state, restarting failed processes, and allocating additional nodes when needed. Our tests show that FMI runs with similar failure-free performance as MPI, but FMI incurs only a 28% overhead with a very high mean time between failures of 1 minute.

未来拥有更多组件的超级计算机将实现更大、更高保真度的模拟，但代价是更高的故障率。传统的减轻故障的方法，例如并行文件系统的检查点/重新启动(C/R)，会产生很大的开销。在未来的极端规模系统中，传统的C/R不太可能在下一次故障发生之前恢复失败的应用程序。为了解决这个问题，我们提出了容错消息传递接口(FMI)，它支持极低延迟的恢复。FMI使用可生存的通信运行时以及快速的内存C/R和动态节点分配来实现这一点。FMI提供了类似于MPI的消息传递语义，但是使用FMI编写的应用程序可以运行失败。FMI运行时软件处理容错，包括检查指向应用程序状态、重新启动失败的进程以及在需要时分配额外的节点。我们的测试表明，FMI运行时具有与MPI相似的无故障性能，但FMI仅产生28%的开销，并且故障间隔时间非常高，为1分钟。

{"title":"FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery","authors":"Kento Sato, A. Moody, K. Mohror, T. Gamblin, B. Supinski, N. Maruyama, S. Matsuoka","doi":"10.1109/IPDPS.2014.126","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.126","url":null,"abstract":"Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems, it is unlikely that traditional C/R will recover a failed application before the next failure occurs. To address this problem, we present the Fault Tolerant Messaging Interface (FMI), which enables extremely low-latency recovery. FMI accomplishes this using a survivable communication runtime coupled with fast, in-memory C/R, and dynamic node allocation. FMI provides message-passing semantics similar to MPI, but applications written using FMI can run through failures. The FMI runtime software handles fault tolerance, including check pointing application state, restarting failed processes, and allocating additional nodes when needed. Our tests show that FMI runs with similar failure-free performance as MPI, but FMI incurs only a 28% overhead with a very high mean time between failures of 1 minute.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123269941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

An Improved Router Design for Reliable On-Chip Networks 可靠片上网络的改进路由器设计

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.39

Pavan Poluri, A. Louri

Aggressive technology scaling into the deep nanometer regime has made the Network-on-Chip (NoC) in multicore architectures increasingly vulnerable to faults. This has accelerated the need for designing reliable NoCs. To this end, we propose a reliable NoC router architecture capable of tolerating multiple permanent faults. The proposed router achieves a better reliability without incurring too much area and power overhead as compared to the baseline NoC router or other fault-tolerant routers. Reliability analysis using Mean Time to Failure (MTTF) reveals that our proposed router is six times more reliable than the baseline NoC router (without protection). We also compare our proposed router with other existing fault-tolerant routers such as Bullet Proof, Vicis and RoCo using Silicon Protection Factor (SPF) as a metric. SPF analysis shows that our proposed router is more reliable than the mentioned existing fault tolerant routers. Hardware synthesis performed by Cadence Encounter RTL Compiler using commercial 45nm technology library shows that the correction circuitry incurs an area overhead of 31% and power overhead of 30%. Latency analysis on a 64-core mesh based NoC simulated using GEM5 and running SPLASH-2 and PARSEC benchmark application traffic shows that in the presence of multiple faults, our proposed router increases the overall latency by only 10% and 13% respectively while providing better reliability.

深入纳米领域的激进技术使得多核架构中的片上网络(NoC)越来越容易出现故障。这加速了对设计可靠的noc的需求。为此，我们提出了一种可靠的NoC路由器架构，能够容忍多个永久故障。与基准NoC路由器或其他容错路由器相比，所提出的路由器实现了更好的可靠性，而不会产生太多的面积和功率开销。使用平均故障间隔时间(MTTF)进行可靠性分析表明，我们提出的路由器比基线NoC路由器(无保护)可靠六倍。我们还将我们提出的路由器与其他现有的容错路由器(如Bullet Proof, Vicis和RoCo)进行了比较，使用硅保护系数(SPF)作为度量。SPF分析表明，本文提出的路由器比现有的容错路由器更可靠。Cadence Encounter RTL编译器使用商用45nm技术库进行硬件综合，结果表明校正电路的面积开销为31%，功耗开销为30%。在GEM5和运行splash2和PARSEC基准应用程序流量模拟的基于64核网格的NoC上进行延迟分析表明，在存在多个故障的情况下，我们提出的路由器在提供更好的可靠性的同时，仅将总体延迟分别增加了10%和13%。

{"title":"An Improved Router Design for Reliable On-Chip Networks","authors":"Pavan Poluri, A. Louri","doi":"10.1109/IPDPS.2014.39","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.39","url":null,"abstract":"Aggressive technology scaling into the deep nanometer regime has made the Network-on-Chip (NoC) in multicore architectures increasingly vulnerable to faults. This has accelerated the need for designing reliable NoCs. To this end, we propose a reliable NoC router architecture capable of tolerating multiple permanent faults. The proposed router achieves a better reliability without incurring too much area and power overhead as compared to the baseline NoC router or other fault-tolerant routers. Reliability analysis using Mean Time to Failure (MTTF) reveals that our proposed router is six times more reliable than the baseline NoC router (without protection). We also compare our proposed router with other existing fault-tolerant routers such as Bullet Proof, Vicis and RoCo using Silicon Protection Factor (SPF) as a metric. SPF analysis shows that our proposed router is more reliable than the mentioned existing fault tolerant routers. Hardware synthesis performed by Cadence Encounter RTL Compiler using commercial 45nm technology library shows that the correction circuitry incurs an area overhead of 31% and power overhead of 30%. Latency analysis on a 64-core mesh based NoC simulated using GEM5 and running SPLASH-2 and PARSEC benchmark application traffic shows that in the presence of multiple faults, our proposed router increases the overall latency by only 10% and 13% respectively while providing better reliability.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123529004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Power and Performance Characterization and Modeling of GPU-Accelerated Systems gpu加速系统的功率和性能表征与建模

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.23

Yukitaka Abe, Hiroshi Sasaki, S. Kato, Koji Inoue, M. Edahiro, M. Peres

Graphics processing units (GPUs) provide an order-of-magnitude improvement on peak performance and performance-per-watt as compared to traditional multicore CPUs. However, GPU-accelerated systems currently lack a generalized method of power and performance prediction, which prevents system designers from an ultimate goal of dynamic power and performance optimization. This is due to the fact that their power and performance characteristics are not well captured across architectures, and as a result, existing power and performance modeling approaches are only available for a limited range of particular GPUs. In this paper, we present power and performance characterization and modeling of GPU-accelerated systems across multiple generations of architectures. Characterization and modeling both play a vital role in optimization and prediction of GPU-accelerated systems. We quantify the impact of voltage and frequency scaling on each architecture with a particularly intriguing result that a cutting-edge Kepler-based GPU achieves energy saving of 75% by lowering GPU clocks in the best scenario, while Fermi- and Tesla-based GPUs achieve no greater than 40% and 13%, respectively. Considering these characteristics, we provide statistical power and performance modeling of GPU-accelerated systems simplified enough to be applicable for multiple generations of architectures. One of our findings is that even simplified statistical models are able to predict power and performance of cutting-edge GPUs within errors of 20% to 30% for any set of voltage and frequency pair.

与传统的多核cpu相比，图形处理单元(gpu)在峰值性能和每瓦特性能方面提供了数量级的改进。然而，目前gpu加速系统缺乏一种通用的功率和性能预测方法，这阻碍了系统设计者实现动态功率和性能优化的最终目标。这是由于它们的功率和性能特征不能很好地跨架构捕获，因此，现有的功率和性能建模方法仅适用于有限范围的特定gpu。在本文中，我们介绍了跨多代架构的gpu加速系统的功率和性能表征和建模。表征和建模在gpu加速系统的优化和预测中都起着至关重要的作用。我们量化了电压和频率缩放对每种架构的影响，得出了一个特别有趣的结果:在最佳情况下，基于开普勒的尖端GPU通过降低GPU时钟实现了75%的节能，而基于费米和特斯拉的GPU分别实现了不超过40%和13%的节能。考虑到这些特点，我们提供了gpu加速系统的统计能力和性能建模，简化到足以适用于多代架构。我们的发现之一是，即使是简化的统计模型也能够在任何电压和频率对的误差范围内预测先进gpu的功率和性能，误差在20%到30%之间。

{"title":"Power and Performance Characterization and Modeling of GPU-Accelerated Systems","authors":"Yukitaka Abe, Hiroshi Sasaki, S. Kato, Koji Inoue, M. Edahiro, M. Peres","doi":"10.1109/IPDPS.2014.23","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.23","url":null,"abstract":"Graphics processing units (GPUs) provide an order-of-magnitude improvement on peak performance and performance-per-watt as compared to traditional multicore CPUs. However, GPU-accelerated systems currently lack a generalized method of power and performance prediction, which prevents system designers from an ultimate goal of dynamic power and performance optimization. This is due to the fact that their power and performance characteristics are not well captured across architectures, and as a result, existing power and performance modeling approaches are only available for a limited range of particular GPUs. In this paper, we present power and performance characterization and modeling of GPU-accelerated systems across multiple generations of architectures. Characterization and modeling both play a vital role in optimization and prediction of GPU-accelerated systems. We quantify the impact of voltage and frequency scaling on each architecture with a particularly intriguing result that a cutting-edge Kepler-based GPU achieves energy saving of 75% by lowering GPU clocks in the best scenario, while Fermi- and Tesla-based GPUs achieve no greater than 40% and 13%, respectively. Considering these characteristics, we provide statistical power and performance modeling of GPU-accelerated systems simplified enough to be applicable for multiple generations of architectures. One of our findings is that even simplified statistical models are able to predict power and performance of cutting-edge GPUs within errors of 20% to 30% for any set of voltage and frequency pair.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116814542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks 候选HPC计算构建块上的算法时间、能量和功率

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.54

JeeWhan Choi, Marat Dukhan, Xing Liu, R. Vuduc

We conducted a micro benchmarking study of the time, energy, and power of computation and memory access on several existing platforms. These platforms represent candidate compute-node building blocks of future high-performance computing systems. Our analysis uses the "energy roofline" model, developed in prior work, which we extend in two ways. First, we improve the model's accuracy by accounting for power caps, basic memory hierarchy access costs, and measurement of random memory access patterns. Secondly, we empirically evaluate server-, mini-, and mobile-class platforms that span a range of compute and power characteristics. Our study includes a dozen such platforms, including x86 (both conventional and Xeon Phi), ARM, GPU, and hybrid (AMD APU and other SoC) processors. These data and our model analytically characterize the range of algorithmic regimes where we might prefer one building block to others. It suggests critical values of arithmetic intensity around which some systems may switch from being more to less time- and energy-efficient than others, it further suggests how, with respect to intensity, operations should be throttled to meet a power cap. We hope our methods can help make debates about the relative merits of these and other systems more quantitative, analytical, and insightful.

我们对几个现有平台上的计算和内存访问的时间、精力和能力进行了微观基准测试研究。这些平台代表了未来高性能计算系统的候选计算节点构建块。我们的分析使用了在先前工作中开发的“能量屋顶线”模型，我们将其扩展为两种方式。首先，我们通过考虑功率上限、基本内存层次访问成本和随机内存访问模式的测量来提高模型的准确性。其次，我们对服务器级、迷你级和移动级平台进行了实证评估，这些平台涵盖了一系列计算和功率特性。我们的研究包括十几个这样的平台，包括x86(传统和Xeon Phi)、ARM、GPU和混合(AMD APU和其他SoC)处理器。这些数据和我们的模型分析地描述了算法制度的范围，我们可能更喜欢一个构建块而不是其他。它提出了算术强度的临界值，围绕这个临界值，一些系统可能会从比其他系统更节省时间和能源，它进一步提出了如何，就强度而言，操作应该被限制以满足功率上限。我们希望我们的方法可以帮助使关于这些和其他系统的相对优点的辩论更具定量、分析性和洞察力。

{"title":"Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks","authors":"JeeWhan Choi, Marat Dukhan, Xing Liu, R. Vuduc","doi":"10.1109/IPDPS.2014.54","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.54","url":null,"abstract":"We conducted a micro benchmarking study of the time, energy, and power of computation and memory access on several existing platforms. These platforms represent candidate compute-node building blocks of future high-performance computing systems. Our analysis uses the \"energy roofline\" model, developed in prior work, which we extend in two ways. First, we improve the model's accuracy by accounting for power caps, basic memory hierarchy access costs, and measurement of random memory access patterns. Secondly, we empirically evaluate server-, mini-, and mobile-class platforms that span a range of compute and power characteristics. Our study includes a dozen such platforms, including x86 (both conventional and Xeon Phi), ARM, GPU, and hybrid (AMD APU and other SoC) processors. These data and our model analytically characterize the range of algorithmic regimes where we might prefer one building block to others. It suggests critical values of arithmetic intensity around which some systems may switch from being more to less time- and energy-efficient than others, it further suggests how, with respect to intensity, operations should be throttled to meet a power cap. We hope our methods can help make debates about the relative merits of these and other systems more quantitative, analytical, and insightful.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121031598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 76

A Medium-Grain Method for Fast 2D Bipartitioning of Sparse Matrices 稀疏矩阵二维快速二分划的中粒方法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.62

D. Pelt, R. Bisseling

We present a new hyper graph-based method, the medium-grain method, for solving the sparse matrix partitioning problem. This problem arises when distributing data for parallel sparse matrix-vector multiplication. In the medium-grain method, each matrix nonzero is assigned to either a row group or a column group, and these groups are represented by vertices of the hyper graph. For an m×n sparse matrix, the resulting hyper graph has m+n vertices and m+n hyper edges. Furthermore, we present an iterative refinement procedure for improvement of a given partitioning, based on the medium-grain method, which can be applied as a cheap but effective post processing step after any partitioning method. The medium-grain method is able to produce fully two-dimensional bipartitionings, but its computational complexity equals that of one-dimensional methods. Experimental results for a large set of sparse test matrices show that the medium-grain method with iterative refinement produces bipartitionings with lower communication volume compared to current state-of-the-art methods, and is faster at producing them.

针对稀疏矩阵划分问题，提出了一种新的基于超图的方法——中粒法。这个问题出现在并行稀疏矩阵-向量乘法的数据分布中。在中粒方法中，将每个矩阵非零分配给行组或列组，这些组由超图的顶点表示。对于m×n稀疏矩阵，生成的超图有m+n个顶点和m+n个超边。此外，我们提出了一种基于中粒度方法的迭代细化过程，用于改进给定的分区，该方法可以作为任何分区方法之后的廉价但有效的后处理步骤。中粒法能够产生完全二维的双分割，但其计算复杂度与一维方法相当。大量稀疏测试矩阵的实验结果表明，与当前最先进的方法相比，迭代细化的中粒方法产生的双分区通信量更小，并且产生速度更快。

引用次数: 32

HPMMAP: Lightweight Memory Management for Commodity Operating Systems 商用操作系统的轻量级内存管理

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.73

Brian Kocoloski, J. Lange

Linux-based operating systems and runtimes (OS/Rs) have emerged as the environments of choice for the majority of modern HPC systems. While Linux-based OS/Rs have advantages such as extensive feature sets as well as developer familiarity, these features come at the cost of additional overhead throughout the system. In contrast to Linux, there is a substantial history of work in the HPC community focused on lightweight OS/R architectures that provide scalable and consistent performance for tightly coupled HPC applications, but lack many of the features offered by commodity OS/Rs. In this paper, we propose to bridge the gap between LWKs and commodity OS/Rs by selectively providing a lightweight memory subsystem for HPC applications in a commodity OS/R environment. Our system HPMMAP provides isolated and low overhead memory performance transparently to HPC applications by bypassing Linux's memory management layer. Our approach is dynamically configurable at runtime, and adds no additional overheads nor requires any resources when not in use. We show that HPMMAP can decrease variance and reduce application runtime by up to 50%.

基于linux的操作系统和运行时(OS/ r)已经成为大多数现代HPC系统的首选环境。虽然基于linux的OS/ r具有广泛的特性集以及开发人员熟悉度等优势，但这些特性是以整个系统的额外开销为代价的。与Linux相比，HPC社区在轻量级OS/R体系结构上有大量的工作历史，轻量级OS/R体系结构为紧密耦合的HPC应用程序提供了可扩展和一致的性能，但缺乏商用OS/R提供的许多特性。在本文中，我们建议通过选择性地为商用OS/R环境中的HPC应用程序提供轻量级内存子系统来弥合lwk和商用OS/R之间的差距。我们的系统HPMMAP通过绕过Linux的内存管理层，透明地为HPC应用程序提供隔离和低开销的内存性能。我们的方法可以在运行时动态配置，并且在不使用时不会增加额外的开销，也不需要任何资源。我们证明HPMMAP可以减少方差并将应用程序运行时间减少多达50%。

引用次数: 12

Active Measurement of Memory Resource Consumption 内存资源消耗的主动测量

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.105

Marc Casas, G. Bronevetsky

Hierarchical memory is a cornerstone of modern hardware design because it provides high memory performance and capacity at a low cost. However, the use of multiple levels of memory and complex cache management policies makes it very difficult to optimize the performance of applications running on hierarchical memories. As the number of compute cores per chip continues to rise faster than the total amount of available memory, applications will become increasingly starved for memory storage capacity and bandwidth, making the problem of performance optimization even more critical. We propose a new methodology for measuring and modeling the performance of hierarchical memories in terms of the application's utilization of the key memory resources: capacity of a given memory level and bandwidth between two levels. This is done by actively interfering with the application's use of these resources. The application's sensitivity to reduced resource availability is measured by observing the effect of interference on application performance. The resulting resource-oriented model of performance both greatly simplifies application performance analysis and makes it possible to predict an application's performance when running with various resource constraints. This is useful to predict performance for future memory-constrained architectures.

分层内存是现代硬件设计的基石，因为它以低成本提供高内存性能和容量。然而，使用多级内存和复杂的缓存管理策略使得在分层内存上运行的应用程序的性能很难优化。由于每个芯片的计算核心数量的增长速度持续快于可用内存的总量，应用程序将越来越需要内存存储容量和带宽，从而使性能优化问题变得更加关键。我们提出了一种新的方法来测量和模拟分层存储器的性能，根据应用程序对关键内存资源的利用:给定内存级别的容量和两个级别之间的带宽。这是通过主动干扰应用程序对这些资源的使用来实现的。通过观察干扰对应用程序性能的影响来衡量应用程序对资源可用性降低的敏感性。由此产生的面向资源的性能模型不仅大大简化了应用程序性能分析，而且使预测应用程序在各种资源约束下运行时的性能成为可能。这对于预测未来内存受限架构的性能非常有用。

{"title":"Active Measurement of Memory Resource Consumption","authors":"Marc Casas, G. Bronevetsky","doi":"10.1109/IPDPS.2014.105","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.105","url":null,"abstract":"Hierarchical memory is a cornerstone of modern hardware design because it provides high memory performance and capacity at a low cost. However, the use of multiple levels of memory and complex cache management policies makes it very difficult to optimize the performance of applications running on hierarchical memories. As the number of compute cores per chip continues to rise faster than the total amount of available memory, applications will become increasingly starved for memory storage capacity and bandwidth, making the problem of performance optimization even more critical. We propose a new methodology for measuring and modeling the performance of hierarchical memories in terms of the application's utilization of the key memory resources: capacity of a given memory level and bandwidth between two levels. This is done by actively interfering with the application's use of these resources. The application's sensitivity to reduced resource availability is measured by observing the effect of interference on application performance. The resulting resource-oriented model of performance both greatly simplifies application performance analysis and makes it possible to predict an application's performance when running with various resource constraints. This is useful to predict performance for future memory-constrained architectures.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129037731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

MapReuse: Reusing Computation in an In-Memory MapReduce System MapReuse:在内存MapReduce系统中重用计算

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.18

Devesh Tiwari, Yan Solihin

MapReduce programming model is being increasingly adopted for data intensive high performance computing. Recently, it has been observed that in data-intensive environment, programs are often run multiple times with either identical or slightly-changed input, which creates a significant opportunity for computation reuse. Recognizing the opportunity, researchers have proposed techniques to reuse computation in disk-based MapReduce systems such as Hadoop, but not for in-memory MapReduce (IMMR) systems such as Phoenix. In this paper, we propose a novel technique for computation reuse in IMMR systems, which we refer to as MapReuse. MapReuse detects input similarity by comparing their signatures. It skips re-computing output from a repeated portion of the input, computes output from a new portion of input, and removes output that corresponds to a deleted portion of the input. MapReuse is built on top of an existing IMMR system, leaving it largely unmodified. MapReuse significantly speeds up IMMR, even when the new input differs by 25% compared to the original input.

MapReduce编程模型越来越多地被用于数据密集型高性能计算。最近，人们观察到，在数据密集型环境中，程序经常使用相同或略有变化的输入运行多次，这为计算重用创造了重要的机会。认识到这一机会，研究人员提出了在基于磁盘的MapReduce系统(如Hadoop)中重用计算的技术，但不适用于内存MapReduce (IMMR)系统(如Phoenix)。在本文中，我们提出了一种新的IMMR系统计算重用技术，我们称之为MapReuse。MapReuse通过比较它们的签名来检测输入的相似性。它跳过从输入的重复部分重新计算输出，从输入的新部分计算输出，并删除与输入的已删除部分对应的输出。MapReuse是建立在现有的IMMR系统之上的，使得它基本上没有被修改。即使新输入与原始输入相差25%，MapReuse也能显著提高IMMR的速度。

引用次数: 15

BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications BigKernel——面向大数据风格应用的高性能CPU-GPU通信管道

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.89

Reza Mokhtari, M. Stumm

GPUs offer an order of magnitude higher compute power and memory bandwidth than CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways, e.g., for transformations, filtering, aggregation, partitioning or other "Big Data" style processing. Yet experience indicates that it is difficult, and often error-prone, to write GPGPU programs which efficiently process data that does not fit in GPU memory, partly because of the intricacies of GPU hardware architecture and programming models, and partly because of the limited bandwidth available between GPUs and CPUs. In this paper, we propose Big Kernel, a scheme that provides pseudo-virtual memory to GPU applications and is implemented using a 4-stage pipeline with automated prefetching to (i) optimize CPU-GPU communication and (ii) optimize GPU memory accesses. Big Kernel simplifies the programming model by allowing programmers to write kernels using arbitrarily large data structures that can be partitioned into segments where each segment is operated on independently, these kernels are transformed into Big Kernel using straight-forward compiler transformations. Our evaluation on six data-intensive benchmarks shows that Big Kernel achieves an average speedup of 1.7 over state-of-the-art double-buffering techniques and an average speedup of 3.0 over corresponding multi-threaded CPU implementations.

gpu提供的计算能力和内存带宽比cpu高一个数量级。因此，gpu似乎非常适合加速以独立方式对大量数据集进行操作的计算，例如，用于转换、过滤、聚合、分区或其他“大数据”风格的处理。然而，经验表明，编写GPGPU程序来有效地处理不适合GPU内存的数据是很困难的，而且经常容易出错，部分原因是GPU硬件架构和编程模型的复杂性，部分原因是GPU和cpu之间可用的带宽有限。在本文中，我们提出了一个大内核方案，该方案为GPU应用程序提供伪虚拟内存，并使用带有自动预取的4阶段管道来实现(i)优化CPU-GPU通信和(ii)优化GPU内存访问。Big Kernel简化了编程模型，它允许程序员使用任意大的数据结构来编写内核，这些数据结构可以被分割成段，每个段都是独立操作的，这些内核使用直接的编译器转换转换成Big Kernel。我们对六个数据密集型基准测试的评估表明，与最先进的双缓冲技术相比，Big Kernel的平均加速提高了1.7，与相应的多线程CPU实现相比，平均加速提高了3.0。

{"title":"BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications","authors":"Reza Mokhtari, M. Stumm","doi":"10.1109/IPDPS.2014.89","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.89","url":null,"abstract":"GPUs offer an order of magnitude higher compute power and memory bandwidth than CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways, e.g., for transformations, filtering, aggregation, partitioning or other \"Big Data\" style processing. Yet experience indicates that it is difficult, and often error-prone, to write GPGPU programs which efficiently process data that does not fit in GPU memory, partly because of the intricacies of GPU hardware architecture and programming models, and partly because of the limited bandwidth available between GPUs and CPUs. In this paper, we propose Big Kernel, a scheme that provides pseudo-virtual memory to GPU applications and is implemented using a 4-stage pipeline with automated prefetching to (i) optimize CPU-GPU communication and (ii) optimize GPU memory accesses. Big Kernel simplifies the programming model by allowing programmers to write kernels using arbitrarily large data structures that can be partitioned into segments where each segment is operated on independently, these kernels are transformed into Big Kernel using straight-forward compiler transformations. Our evaluation on six data-intensive benchmarks shows that Big Kernel achieves an average speedup of 1.7 over state-of-the-art double-buffering techniques and an average speedup of 3.0 over corresponding multi-threaded CPU implementations.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130337481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀